AFRL-SR-BL-TR-98- 

REPORT  DOCUMENTATION  PAGE^  ’  q  ^ 

Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  resporee.  including  - ^  /  \ 

gathenng  and  maimajning  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Sendee _  _ _ ^  O  is 

OTilection  of  information,  including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports  121 5  Jefferson 
Davis  Hi(^hway.  Sjite  1 204,  Arlington,  VA  22202-4302.  and  to  the  Office  of  Management  and  Budget.  Paperwork  Reduction  Project  (0704-0188).  Washington.  DC  20503 


Davis  Hic|hwav.  Siite  1 204,  Arlington,  VA  22202-4302.  and  to  the  Office  of  Mana< 


2.  REPORT  DATE 
1  Auaust  1997 


4.  TITLE  AND  SUBTITLE 

Mathematical  Theory  of  Neural  Networks 


1.  AGENCY  USE  ONLY  {Leave 
Blank) 


AUTHORS 
Eduardo  Sontag 
Hector  Sussmann 


3.  REPORT  TYPE  AND  DATES  COVERED 

Rnal  Technical  report,  onfflit  to  7i'Oiror  oI/oiHH  io  0^|^Sck^ 


5.  FUNDING  NUMBERS 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 
Rutgers,  The  State  University  of  New  Jersey 
New  Brunswick,  NJ  08903 


8.  PERFORMING  ORGANIZATION  REPORT 
NUMBER 


9.  SPONSORING  /  MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 
Air  Force  Office  of  Scientific  Research 

AFOSR/NM,  Building  41 0  ^ 

Bolling  AFB,  DC  20332-6448  (  \ 


10.  SPONSORING  /  MONITORING  AGENCY 
REPORT  NUMBER 


12a.  DISTRIBUTION  /  AVAILABILITY  STATEMENT 
Approved  for  public  release 
distribution  unlimited 


12b.  DISTRIBUTION  CODE 


1 3.  ABSTRACT  (Maximum  200  words) 

This  report  focuses  on  fundamental  theoretical  issues  relevant  to  the 
capabilities,  performance,  and  limitations  of  artifictal  neural  networks. 

For  static  (feedfonward)  networks,  subjects  of  investigation  included  the 
study  of  error  surfaces  for  least  squares  fitting,  VC  and  other  learning 
dimensions,  representability  questions,  and  function  approximation. 

For  dynamic  (recurrent)  nets,  covered  are  questions  dealing  with  parameter 
identification  and  modeling,  realizability  and  other  systems-theoretic  issues, 
theoretical  computational  capabilities,  and  learning-theoretic  issues. 


14.  SUBJECT  TERMS 

neural  networks 

15,  NUMBER  OF  PAGES 

39 

learning 
systems  theory 

16.  PRICE  CODE 

17.  SECURITY  CLASSIFICATION 

18.  SECURITY  CLASSIFICATION 

19.  SECURITY  CLASSIRCATION 

20.  LIMITATION  OF 

OF  REPORT 

OF  THIS  PAGE 

OF  ABSTRACT 

ABSTRACT 

UNCLASSIFIED 

UNCLASSIFIED 

UNCLASSIFIED 

NSN  7540-01-280-5500 


Standard  Form  298  (Rev.  2-89) 
Prescribed  by  ANSI  Std.  Z39-1 
298-102 


mSPEOTEO  # 


19980511  074 


MATHEMATICAL  THEORY  OF  NEURAL  NETWORKS 


Eduardo  D.  Sontag  and  Hector  J.  Sussmann 
SYCON  -  Rutgers  Center  for  Systems  and  Control 
Department  of  Mathematics,  Rutgers  University 
New  Brunswick,  NJ  08903 


ABSTRACT 

This  report  focuses  on  fundamental  theoretical  issues  relevant  to  the  capabilities,  performance, 
and  limitations  of  artificitd  neural  networks. 

For  static  (feedforward)  networks,  subjects  of  investigation  included  the  study  of  error  sur¬ 
faces  for  least  squares  fitting,  VC  and  other  learning  dimensions,  representability  questions,  and 
function  approximation.  For  dynamic  (recurrent)  nets,  covered  are  questions  dealing  with  para¬ 
meter  identification  and  modeling,  realizability  and  other  systems-theoretic  issues,  theoretical 
computational  capabilities,  and  learning- theoretic  issues. 


1 


INTRODUCTION 


1 


1  Introduction 

This  project  focused  on  theoretical  issues  regarding  capabihties.  performance,  and  limitations  of 
“neural  computation.”  (Artificial)  neural  networks  are  structures  inspired  by  the  way  biological 
nervous  systems  process  information.  They  consist  of  a  large  number  of  highly  interconnected 
relatively  simple  processing  elements  or  “neurons”  whose  collective  behavior  is  interpreted  as  a 
computation,  typically  involving  tasks  of  classification,  ftmction  evaluation,  systems  simulation, 
or  control.  Most  often,  neural  networks  are  designed  by  numerical  parameter-fitting  routines, 
such  as  gradient  descent,  which  are  in  part  appealing  because  of  their  analogies  to  “learning”  in 
biological  systems  by  means  of  adjustments  to  (excitatory  or  inhibitory)  synaptic  connections. 
Much  research  carried  out  by  AFOSR  contractors  and  Air  Force  labs  concerns  the  use  of  neural 
network  techniques  when  dealing,  among  others,  with  problems  of  fault  detection  and  classifica¬ 
tion  (in  particular,  for  reconfigurable  aircraft),  design  of  controls  valid  over  large  flight  envelopes, 
and  precision  laying  of  composites  on  aircraft  structures.  Unquestionably,  there  is  a  demand  for 
basic  theoretical  foundations  for  the  analysis  and  comparison  of  different  models  and  algorithms. 
This  work  concerns  the  development  of  such  foundations. 

The  area  of  neural  computing  has  seen  several  periods  of  high  interest  and  expectations, 
since  at  least  the  late  1940s.  separated  by  periods  of  reaction  to  exaggerated  claims  by  some 
of  its  proponents.  During  the  times  of  less  cictivity,  there  was  nonetheless  a  continued  research 
effort  by  major  researchers  such  as  Kohonen  and  Grossberg  all  along,  but  the  latest  resiurgence 
in  interest  was  due  in  large  part  to  the  two  independent  developments,  namely  the  introduction 
and  study  of: 

1.  classes  of  functions  defined  by  feedforward  nets  or  multilayer  perceptrons  ’  with  sigmoidal 
activations,  which  are  dense  (in  various  spaces  of  real- valued  vector  functions),  could  in 
principle  be  fit  to  data  through  nonlinear  optimization,  and  may  be  evaluated  by  finely- 
grained  panrallel  computation,  Euid 

2.  classes  of  dynamical  systems  called  feedback  or  recurrent  nets,  which  also  possess  approx¬ 
imation  properties  (now  with  regards  to  classes  of  draamiccd  systems),  pzu'ametric  forms 
suitable  for  optimization,  and  parallel  computation  characteristics. 

The  first  line  of  work  was  popularized  by  the  Rumelhart  and  McClelland  “PDF”  books,  while  the 
second  was  heavily  influenced  by  a  technique  for  associative  memory  storage  and  retrieval  due 
to  Hopfield.  The  catalyst  for  both  was  the  contemporaneous  availability  of  low-cost  computing 
power  (for  simulations  and  implementations)  made  possible  by  microprocessor  technology,  in 
amounts  which  were  unimaginable  when  similar  ideas  had  been  considered  in  the  past. 

1.1  Main  Motivation  for  This  Work 

Many  practical  successes  of  the  associated  technologies  have  been  clciimed  in  both  the  engineering 
amd  popular  press.  A  fair  amount  of  research  exists  by  now  on  theoretical  issues  for  neural 
networks,  though  papers  are  scattered  in  various  journals.  The  “NeuroCOLT”  project  in  Europe 
is  one  excellent  source  of  material  which  is  available  online,  as  well  as  references.  (See  also  the  first 
Pi’s  Web  pages,  aind  references  later  and  in  cited  papers,  for  further  pointers  to  the  hterature.) 
There  are  several  textbooks  dealing  with  theoretical  issues  concerning  neural  networks,  such  as 
the  one  recently  published  by  Mdyasagar,  but  a  great  deal  of  basic  questions  are  still  poorly 
understood. 
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1.2  Another  Motivation 

There  is  another  motivation  as  well  which  underlies  the  work  reported  here.  This  cirises  from 
an  issue  that  has  come  up  in  many  domains  -including  computer,  communications,  cuid  control 
engineering-  concerning  the  interface  between  on  the  one  hand  the  continuous,  physical,  world 
and  on  the  other  hand  discrete  devices  such  as  digital  computers,  capable  of  symbolic  processing. 
Lately  the  term  “hybrid”  systems  has  become  popular  in  this  context. 

For  instance,  classical  control  techniques,  especially  for  linear  systems,  have  proved  spectcic- 
ularly  successful  in  automatically  regulating  relatively  simple  systems;  how'ever.  for  large-scale 
problems,  controllers  resulting  from  the  application  of  the  well-developed  theory  are  used  as  build¬ 
ing  blocks  of  more  complex  systems.  The  integration  of  these  systems  is  often  accomplished  by 
means  of  ad-hoc  techniques  that  combine  pattern  recognition  devices,  various  types  of  switching 
controllers,  and  humans  -or,  more  recently,  expert  systems,-  in  supervisory  capabilities.  This  has 
caused  renewed  interest  in  the  formulation  of  mathematical  models  in  which  the  interface  between 
the  continuous  and  the  symbolic  is  naturally  accomplished  and  system-theoretic  questions  can  be 
formulated  and  resolved  for  the  resulting  models.  Successful  approaches  will  eventually  allow  the 
interplay  of  modern  control  theory  with  automata  theory  and  other  techniques  from  computer 
science.  This  interest  has  motivated  much  resetirch  into  areas  such  as  discrete-event  systems, 
supervisory  control,  and  more  generally  “intelligent  control  systems” .  As  an  illustration  of  the  in¬ 
terest  level  this  subject  attracts,  suffice  it  to  say  that  a  recent  workshop  at  Rutgers,  co-sponsored 
and  cofunded  by  the  Pis’  Center  for  Systems  and  Control  (SYCON),  was  attended  by  well  over  a 
hundred  reseaurchers  in  hybrid  systems  (only  a  subset  of  those  who  applied,  being  constrained  by 
space  and  funding  limitations),  and  resulted  in  a  lively  exchange  of  ideas,  reflected  in  the  extens¬ 
ive  proceedings  volume  [6].  (The  Pis  are  actively  involved  in  organization  activities  for  further 
hybrid  systems  meetings,  including  membership  in  the  Grenoble  1997  program  committee.) 

The  present  work  has  as  one  of  its  underlying  objectives  the  analysis  of  neural  networks  as  a 
paradigm  in  which  to  understand  hybrid  systems  issues.  Other  pareidigms  could  be  used  as  well; 
it  so  turns  out  that  neural  nets  appear  to  be  a  pcirticularly  appealing  and  very  mathematically 
natural  class  of  nonlinear  systems,  as  will  be  explained  in  the  report. 

Similarly,  in  numerical  analysis  aind  optimization,  much  activity  has  taken  place  in  the  realm 
of  continuous  algorithms.  Included  here  are  such  areas  as  differential-equation  implementations 
of  “interior  methods”  for  linear  and  nonlinear  programming,  the  use  of  flows  on  manifolds  to 
solve  eigenvalue  and  optimization  problems  (in  particular  the  work  of  Brockett  and  his  school), 
and  the  Blum-Shub-Smale  approach  to  “real  \-alued”  algorithms.  All  these  deal  in  one  way  or 
another  with  the  power  of  “analog  computing"  to  solve  problems  that  can  also  be  attacked  with 
discrete/symbolic  techniques.  Again,  we  view  the  neural  net  paradigm  as  one  in  which  to  explore 
the  interfcLce  between  “digital”  and  “analog”  modes  of  computation.  In  fact,  work  by  one  of  the 
Pis  and  one  of  his  students  has  succeeded  in  formulating  a  new  computer-science  approach  to 
these  issues  and  has  been  reported  upon  to  the  general  scientific  community  (Science,  April  28, 
1995). 

1.3  Neural  Networks 

The  term  “neural  network”  tends  to  be  applied  loosely,  making  the  area  too  broad  and  ill-defined. 
For  a  mathematical  study,  especially  when  providing  comparative  results,  precise  definitions 
axe  required.  For  purposes  of  this  report,  therefore,  we  adopt  the  most  popular  paradigm: 
(2irtificial)  neural  nets  are  systems  composed  of  saturation-type  nonlinearities,  linear  elements. 


1  INTRODUCTION 


3 


and  optionally  dynamic  components  (integrators  in  continuous  time,  delays  in  discrete  timei. 
When  restricting  attention  to  these,  while  still  a  huge  subject,  the  scope  leaves  out  a  variety  of 
topics  sometimes  included,  such  as  “raditd  basis”  networks  or  multiplicative  effects  (high-order 
networks) ;  many  of  our  results  and  reported  ideas  do  apply  to  such  more  general  classes  as  well 
(for  instance,  those  on  VC  dimension,  the  estimates  on  classification  of  sets  in  general  position, 
results  on  impossibility  of  feedback  control  when  using  certadn  architectures,  and  so  forth),  but 
by  narrowing-down  the  area  we  can  focus  our  discussion  better.  There  are  two  basic  tj^jes  of 
networks:  feedforward  and  feedback,  introduced  next. 

Feedforward  Networks 

This  model  consists  of  feedback-free  interconnections  of  basic  units,  or  “neurons” .  Each  unit 
operates  on  some  hnear  combination  of  the  outputs  of  other  units  and  signals  arising  from  outside 
the  network  (the  environment).  The  output  produced  by  each  unit  has  an  intensity  that  is  a 
nonlinear  function  of  this  aggregate  input,  and  it  is  transmitted  to  other  neurons.  A  subset  of 
the  outputs,  or  a  set  of  linear  combinations  of  these  outputs,  represents  the  output  of  the  network 
as  a  response  to  the  inputs  from  the  environment.  In  this  manner,  a  neural  network  determines 
an  input/outpui  mapping.  When  viewing  the  linear  combination  coefficients  as  parameters  to  be 
“tuned”  or  adjusted,  networks  play  the  role  of  parametric  families  of  functions. 

An  often-mentioned  engineering  justification  for  the  use  of  neural  networks  lies  in  their  distrib¬ 
uted  character,  which  in  principle  allows  a  degree  of  fault  tolerance.  Moreover,  direct  hmdware 
implementations  of  neural  networks  exploit  their  inherent  parallelism,  and  often  run  orders  of 
magnitude  faster  than  software  simulations.  Most  major  chip  manufacturers  have  offered  neural 
net  chips,  and  various  complete  neurocomputers  are  commercially  available.  As  one  example  of 
direct  AF  interest,  an  8,000-neuron  chip  developed  by  AAC  (Accurate  Automation  Corporation) 
has  been  reportedly  used  to  control  a  subscale  model  of  the  the  USAF  immanned  5.5  Mach 
“waverider”  (cf.  Aviation  Week  &  Space  Tech,  April  3,  1995,  pp.  78-79). 

Also  moti\ating  the  use  of  nets  is  the  belief  that  in  some  sense  they  are  an  especially  appropri¬ 
ate  family  of  parameterized  models,  as  opposed  to,  say,  finite  Fourier  series  or  splines.  Numerical 
and  statistical  advantages  are  said  to  include  excellent  capabilities  for  learning,  adaptation,  aind 
generalization.  We  do  not  argue  the  case  for  or  against  this  belief  here.  Notwithstanding  some 
popular  claims  to  the  contrary  (for  instance,  supported  by  the  rate  of  approximation  resiilts  re¬ 
viewed  in  the  report),  we  feel  that  the  situation  is  still  very  unclear  regarding  the  relative  merits 
of  using  neural  nets.  Nevertheless,  since  techniques  based  on  neural  networks  play  a  role  as  part 
of  the  general  set  of  tools  in  estimation,  learning,  and  control,  a  deeper  understanding  of  the  topic 
is  a  prerequisite  to  theoretical  comparisons. 

Feedback  Nets 

Recurrent  neural  networks  are  those  which  include  dynamic  elements  (delay  lines  or  integrators, 
in  discrete  and  continuous  time  scales  respectively),  in  contrast  to  feedforward,  nets,  which  only 
contmn  static  units.  Their  behavior  is  described  by  means  of  systems  of  difference  (in  discrete 
time)  or  differential  (in  continuous  time)  equations. 

Part  of  this  report  focuses  on  such  dynamic  networks,  whose  interest  arises  from  the  many 
applications  in  which  input  data  is  a  function  of  time.  For  example,  the  inputs  fed  into  a 
speech  recognition  system  might  be  sequences  consisting  of  windowed  Fourier  coefficients  and 
signal  levels  an  each  instaint;  as  another  example,  in  control  problems  the  inputs  to  a  regulator 
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might  be  time-dependent  measurements  of  the  plant  being  controlled  as  well  as  the  successive 
coordinates  of  a  path  to  be  tracked.  It  is  desirable  to  make  use  of  the  information  implicit  in 
the  correlations  and  dependencies  that  exist  among  the  inputs  at  different  times,  and  this  can 
be  achieved  by  considering  models  which  consist  of  dynamiccil  systems,  of  which  recurrent  nets 
provide  a  particularly  rich  subclass. 

Recurrent  networks  are  among  the  models  considered  by  Grossberg  and  his  school  during 
the  last  twenty  or  more  years,  and  include  the  networks  proposed  by  Hopfield  for  associative 
memory  and  optimization.  They  have  been  employed  in  the  design  of  control  laws  for  robotic 
manipulators  (Jordan),  as  well  as  in  speech  recognition  (FaUside,  Kuhn),  speaker  identification 
(Anderson),  formal  language  inference  (Giles),  and  sequence  extrapolation  for  time  series  pre¬ 
diction  (Farmer).  In  both  the  areas  of  signal  processing  and  control,  recurrent  nets  have  been 
proposed  as  generic  identification  models  or  as  prototype  dynamic  controllers.  In  addition,  as  dis¬ 
cussed  later,  theoretical  results  about  neural  networks  established  their  universality  as  models  for 
systems  approximation  as  well  as  analog  computing  devices.  Special  purpose  chips  are  being  built 
to  implement  recurrent  nets  directly  in  hardware,  just  as  done  for  feedforward  nets;  for  instance, 
Hitachi’s  Wafer  Scale  Integration  chips  have  been  designed  to  implement  Hopfield  nets  with  over 
500  neurons  and  30,000  synaptic  connections.  Electrical  circuit  implementations  of  recurrent  nets, 
employing  resistively  connected  networks  of  nonline2ir  amplifiers,  with  the  resistor  characteristics 
used  to  refiect  the  desired  weights,  have  been  suggested  as  analog  computers,  in  particuleir  for 
solving  constrained  optimization  problems  and  for  implementing  content-addressable  memories. 

1.4  Learning  Theory  and  Other  Methodologies 

We  wish  to  study  the  question  what  are  special  properties  of  neural  networks?  (as  models 
for  fimctions  or  for  dynamical  systems).  It  is  extremely  difiicult  to  answer  directly  questions 
such  as  “how  do  nets  compcire  with  Fourier  expansions?”  —  obviously,  if  the  task  involves 
periodic  signals,  it  would  be  innapropriate  to  use  neural  networks,  but  if  one  has  data  that  is 
partitioned  naturally  by  affine  subspaces,  networks  may  be  more  indicated.  It  is  more  reasonable 
to  search  for  a  theoretical  understanding  of  properties  of  nets  according  to  various  accepted 
quantifiable  criteria,  such  as  rates  of  approximation,  learning-theoretic  (VC,  Pollard)  dimensions, 
metric  entropy  of  associated  function  spax;es,  interpolation  and  extrapolation  ( “generalization” ) 
measures,  pattern  classification  power,  and  so  forth.  This  is  why  we  focused  much  of  our  work 
on  questions  arising  from  a  learning-theoretic  framework  that  allows  the  formulation  of  such 
questions.  This  framework  leads  to  posing  mathematical  problems  whose  solution  requires  tools 
from  areas  as  diverse  as  approximation  theory,  stochastic  processes,  or  analytic  fimction  theory. 

Other  questions  about  networks  lead  to  issues  of  a  system-theoretic  character,  such  as  de¬ 
ciding  if  there  are  redimdant  internal  states  in  a  feedback  net  (controllability,  observability)  or 
if  parameters  can  be  identified  from  input /output  data.  For  these  problems,  tools  from  control 
theory  are  natural. 

1.5  Scope  and  Organization  of  the  report 

The  main  subtopics  of  our  reseaurch  can  be  organized  roughly  into  these  broad  categories: 

•  Static  (“Feedforward”)  Nets  for  Classification  and  Function  Approximation 

—  Critical  Points  of  Error  Function 

—  VC  Dimension  Bounds 
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-  Classification  of  Inputs  in  Genertil  Position 

—  Uses  for  Implementation  of  State-Feedback  Control 
o  Single  vs  Multiple  Layers 

o  Control  of  Linear  Systems  with  Saturation:  Continuous  and  Discrete  Time 

-  Function  Approximation 

•  Dynamic  ( “Feedback"’  or  “Recurrent” )  Nets  for  Systems  Modeling 

-  Sample  Complexity  for  Identification 

-  Parameter  Identification  and  System-Theoretic  Aspects 

-  Nets  as  Models  of  Analog  Computing 

-  Hybrid  Systems:  Piecewise  Linear 

We  did  not  organize  this  report  precisely  according  to  this  outline  of  scope,  however.  We  prefer 
to  follow  a  more  natural  discussion,  introducing  subjects  according  to  their  motivation  (structure, 
learning,  feedback  control,  etc).  Thus,  the  first  part  of  this  report.  Section  2,  provides  a  brief 
introduction  to  neural  networks,  learning  theory,  construction  of  feedback,  dynamical  systems 
questions,  computability,  and  other  ingredients  of  the  work.  We  attempt  to  give  as  few  technical 
details  as  possible  while  still  conveying  the  main  issues,  results,  and  problems. 

Given  the  range  of  subjects  treated,  and  the  different  methodologies  employed  in  each,  lack  of 
space  prevents  giving  details  on  all  subtopics.  Thus  we  have  chosen  only  a  few  selected  subtopics, 
especially  those  represented  in  very  recent  work  and  hence  possibly  less  easily  available  to  readers, 
to  be  explained  in  somewhat  more  detail,  in  the  last  part,  Section  3.  Overall,  however,  most 
of  the  discussion  in  this  report  is  informal,  with  references  to  the  literature  for  precise  technical 
points.  The  Web  page 

http:  /  /  www.math.rutgers.edu/"  sontag/ 

provides  access  to  a  substantial  number  of  papers  by  the  Pis,  and  these  may  be  consulted  for 
many  more  detciils  and  a  mathematically  rigorous  presentation. 
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2  Overview 

In  this  pcirt  of  the  report,  we  discuss  the  model  of  neural  nets,  and  we  explziin  in  gen¬ 
eral  terms  the  types  of  problems  considered.  The  next  part  has  some  further  technical 
details  on  selected  subtopics.  As  mentioned  earlier,  the  Web  page  found  at  the  URL: 
http://www.math.rutgers.edu/sontag/  provides  access  to  a  substantial  number  of  papers 
by  the  Pis.  to  be  consulted  for  details  and  a  mathematically  rigorous  presentation. 

We  first  discuss  the  model  in  informal  terms. 


Feedforward  (Static)  Nets 


A  feedforward  neural  network  is  an  interconnection  of  basic  processors  of  a  certain  particular 


form. 


inputs 


affine 

combination 


output 


As  discussed  in  the  introductory  section,  we  restrict  attention  here  to  the  standard  paradigm  in 
which  each  processor  is  a  “first  order  neuron,”  though  many  of  our  results  apply  to  more  general 
models.  X  first-order  neuron  is  a  device  with  multiple  inputs  and  a  single  output  channel;  it 
integrates  its  inputs  according  to  “synaptic”  weights  (positive  values  of  weights  correspond  to 
excitatory  connections  and  negative  values  to  inhibitory  ones,  in  the  biological  analogy),  and,  with 
an  intensity  that  is  a  function  of  the  aggregate  value  obtained  from  this  combination  of  signals,  it 
“fires”  its  response  signal.  The  inputs  to  each  neuron  are,  themselves,  outputs  from  other  neurons 
or  signals  originating  from  outside  the  system  (inputs  to  the  network).  The  output  signal  that  a 
neuron  produces  is  broadcast  to  the  rest  of  the  system.  Some  set  of  linear  combinations  of  signals, 
or  as  a  special  case  just  the  outputs  of  a  designated  set  of  “output  neurons,”  is  interpreted  as 
the  output  signal  produced  by  the  whole  network. 

Mathematically,  the  output  produced  by  each  individual  neuron  in  the  network  has  the  form 
<7(oo  +  aiUi-l-...  +  OTn«m))  where  (ui,...,Uto)  are  the  inputs  to  that  neuron.  The  transformation 
a  :  R  — >  R  is  called  the  “activation  function”  Sometimes  one  assumes  that  all  the  activations 
are  the  same,  equal  to  a  fixed  <t;  when  this  happens,  we’ll  say  here  that  the  net  is  homogeneous. 
The  coefficients  Oi  of  the  affine  combination  of  inputs  that  is  computed  by  each  neuron  are  often 
called  “weights”  or  “programmable  parameters”  (oo  is  sometimes  called  the  “bias"),  and  are  in 
general  different  for  each  processor  in  the  given  network. 

The  interconnection  structure  can  be  formally  specified  in  terms  of  a  graph  whose  nodes  are 
the  neurons,  or  equivalently  in  matrix  theoretic  terms.  Once  that  this  interconnection  structure, 
the  activation  functions  cr  for  each  neuron,  and  the  values  of  the  weights  have  been  fixed,  the 
behavior  of  the  network,  as  a  device  transforming  external  inputs  to  outputs,  is  well-defined. 

For  example,  the  left  part  of  Figure  1  provides  a  “block  diagram”  for  a  static  net  which 
computes  the  function  y  =  2(74^3(Ti(5ui— U2)+2o’2(tti+2u2-M)+l)+5<T3^— 3o’2(«i+2u2+l)— l)  • 
(Ignore  both  the  right  diagram  and  the  dotted  line  for  now.)  There  are  four  neurons,  with 
activations  ai,  i  =  1, 2. 3, 4,  two  external  inputs  ui  and  U2,  tuid  the  scalar  output  y  of  the  network 
is  a  linear  combination  of  the  outputs  of  two  of  the  neurons;  the  numbers  2, 3, . . . ,  —  1  are  the 
weights  of  the  network. 


The  Activation  Function 

There  are  several  choices  for  a  that  are  t}T>ical  in  theory  and  applications. 
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»-*ff4(3»j(Suj-«2)+2»2(ui^Juj+l)+l)+Sff3(-3<T2(ui+Ju2+l)-l)  *1  =€ri(2xi+i2-«l+U2).  ±2  =  <'2(-»2+3«2)-  »  =  *1 

Figure  1:  Examples  of  one  Feedforward  and  one  Recurrent  Net 


Figure  2:  Different  Functions  er 


The  first  is  the  “sign”  function,  sign  (i)  =  x/|i|  (zero  for  x=0),  or  its  relative,  the  hardlimiter, 
threshold,  or  Heaviside  function  H(i),  which  equals  1  if  x  >  0  and  0  for  i  <  0  (in  either  case,  one 
could  define  the  value  at  zero  differently;  results  do  not  change  much).  In  order  to  apply  various 
numerical  techniques,  one  often  needs  a  differentiable  a  that  somehow  approximates  sign(x)  or 
U{x).  For  this,  it  is  customary  to  consider  the  hyperbolic  tangent  tanh(i),  which  is  close  to  the 
sign  funrtion  when  the  “gain”  7  is  large  in  tanh(7x).  Equivalently,  up  to  translations  and  change 
of  coordinates,  one  may  use  the  standard  sigmoid 

=  (1) 

Also  common  in  practice  is  a  piecewise  linear  fimction, 

{—1  ifx<— 1 

1  iti>l  (2) 

X  otherwise. 

This  is  sometimes  called  a  “semilinear”  or  “saturated  linearity”  function. 

Feedback  (Dynamic)  Networks 

When  studying  feedforward  nets,  there  is  no  need  to  introduce  an  explicit  concept  of  time. 
Outputs  may  be  understood  either  as  being  produced  instantaneously,  or  after  a  fixed  processing 
delay,  but  mathematically  it  makes  no  difference:  in  either  case,  the  behavior  of  the  network  can  be 
interpreted  as  a  mapping  from  input  values  to  output  values.  We  always  think  of  feedforward  nets 
as  computing  functions  between  finite-dimensional  input  and  output  spaces.  As  an  illustration, 
the  network  that  was  shown  in  the  first  example  in  Figure  1  induces  a  mapping  ^  R. 

,  When  interconnection  graphs  have  cycles,  in  contrast,  it  is  in  general  impossible  to  define  the 
behavior  of  a  network  solely  in  static  terms,  since  there  may  arise  algebraic  inconsistencies.  This 
is  illustrated  by  this  simple  network: 
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with  for  instcince  a(x)  =  x,  for  which  the  output  y  is  undefined  if  u  is  a  nonzero  input.  One 
must  explicitly  introduce  a  processing  delay  and  understand  the  behavior  as  an  iteration,  starting 
from  some  initial  state.  In  graphical  terms,  a  way  to  represent  this  situation  is  by  adding  to  the 
definition  of  networks,  besides  neurons  as  described  earlier,  also  dynamic  elements,  namely  unit 
delays. 

y 


For  instance,  in  this  example,  if  the  each  initial  value  of  y  is  yo,  then  cifter  one  step  one  has  y  equal 
to  (T(yo +ti(0)),  where  u(0)  is  the  input  at  “time  0” ,  after  two  steps  one  has  (T(a(t/o+u(0))  +u(l)), 
and  so  forth.  In  this  manner,  the  behavior  of  a  network  is  described  by  a  system  of  difference 
equations  —  one  equation  for  the  state  of  each  neuron  —  which  are  forced  by  the  external  inputs. 
One  obtains  discrete-time  systems  in  the  sense  standard  in  system  theory  (see  e.g.  [5]). 

An  alternative  way  to  model  feedback  nets  is  to  use  a  continuous  time  scale  and,  accordingly, 
systems  of  differential  instead  of  difference  equations.  This  alternative  corresponds  to  thinking  of 
neurons  as  capacitors.  It  can  be  represented  by  using  integrators  as  dynamic  elements,  and  gives 
rise  to  continuous-time  systems  in  the  sense  of  system  theory.  As  a  simple  example,  consider  the 
right  part  of  Figure  1.  The  behavior  of  this  net  is  described  by  the  following  system  of  differential 
equations: 


Xi(t) 

=  ai{2xi{t)  +  X2it)  -  Ui{t)  +  U2{t)) 

(3a) 

12  (t) 

=  <T2{  -  X2it)  +  3u2(t)) 

(3b) 

y{t) 

=  Ij  (t)  , 

(3c) 

where  the  dot  indicates  derivative  with  respect  to  time,  cti  and  CT2  are  two  acti\ntion  functions, 
the  input  to  the  system  is  given  by  the  vector  u{t)  =  (ui{t),U2{t)),  and  the  output  at  time  t  is 
the  current  value  of  xi.  (We  usually  omit  arguments  “f”.) 

One  speaks  of  (discrete  or  continuous  time)  feedback,  recurrent,  or  dynamic  nets  when  dy¬ 
namic  elements  are  included. 

One  way  to  associate  an  input /output  behavior  to  a  dynamic  net,  given  a  specified  set  of 
initiaJ  conditions,  is  by  looking  at  the  last  output  that  results  after  a  finite-length  input  has 
been  presented.  For  instance,  in  the  above  example,  suppose  that  we  fixed  the  initial  state 
as  xi(0)  =  X2(0)  =  0.  Then,  given  any  input  «(f)  =  (ui(t),«2(<))  defined  for  t  e  [0,T],  we 
may  solve  the  system  of  differential  equations  (3)  with  u(-)  substituted  into  the  right-hand  side. 
(More  precisely,  one  should  assume,  for  instance,  that  a  satisfies  a  globally  Lipschitz  continuity 
condition,  so  that  solutions  are  indeed  defined  for  all  <  €  [0,T],  and  that  the  input  components 
are  in  £°°(0,oo).)  In  this  way  we  obtain  a  state  trajectory  (xi(t),X2(t)).  We  may  then  read-out 
the  output  y(T)  =  xi(T)  and  interpret  it  as  the  output  of  the  network  in  response  to  the  forcing 
fimction  u. 

2.1  Definitions:  Feedforward  Nets 
By  an  (m,  n)-layer  we  mean  a  triple 


(A,  0,0-) 


(4) 
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consisting  of  a  matrix  G  a  vector  o  €  S".  and  a  diagonal  mapping 


(s^\ 

f  \ 

a  :  R"  R’'  : 

1-4 

\Xn} 

\cr„(a:„)/ 

where  are  maps  R  ->•  R.  The  mapping  induced  by  the  layer  E  =  (^,a.cr)  is  the 

function 

(pj,  :  :  u  ^  a  {Au  +  a)  (6) 

The  component  maps  ai, . . .  ,0"^  are  the  activations  of  the  layer.  If  it  is  the  case  that  all  the  Oi 
are  equal  to  a  fixed  function  a,  we  say  that  the  layer  is  homogeneous  with  activation  a  and  write 
also  instead  of  <t. 

Let  Jk  be  a  positive  integer.  A  k-hidden  layer  net  is  a  fc  -i-  1-tuple  (S\  . . . ,  of  layers, 

where  is  homogeneous  with  identity  activation  a{x)  =  x.  The  mapping  induced  by  the  net 
. . . ,  where  each  E*  is  an  (ni^i,nt)-layer,  is  the  composition 

where  m  =  no  and  p  =  Uk+i-  The  integer  n  :=  n,  + . . .  +  n*  is  the  number  of  neurons  of  the  net. 
The  spaces  R"*  and  RP  are  called,  respectively,  the  input  and  output  spaces  of  the  net.  (The  layers 
S  \ . . . ,  2*=  mre  often  called  hidden  layers  to  reflect  the  fact  that  the  signals  in  the  intermediate 
spaces  are  not  part  of  inputs  or  outputs  and  in  that  sense  they  are  “hidden”  from  the  external 
environment.)  The  activations  of  the  net  are  the  activations  of  the  layers  . . . ,  2*=.  We  say  that 
the  net  is  homogeneous  with  activation  cr  if  all  the  activations  of  the  layers  E  ,  i  =  1, . . .  ,k  are 
equal  to  the  fixed  function  a.  Obviously,  a  vector  mapping  /  :  R"*  RP  is  induced  by  some  net 
with  activations  belonging  to  a  given  set  5  if  and  only  if  each  of  its  components  /j  :  R”*  ->  R  is 
induced  by  some  net  with  activations  from  this  set  5.  Thus  one  often  studies  only  scalar-output 
nets. 

As  an  illustration,  take  again  the  diagram  in  the  left  part  of  Figure  1,  ignoring  both  the  right 
diagram  and  the  dotted  line  for  now.  This  is  a  pictorial  representation  of  a  2-hidden  layer  net 
whose  (2, 2)-layer  E^  is  defined  by 

-fi  fl.-o— (::) 

while  the  (2, 2)-layer  is  defined  by 

and  the  (2,  l)-layer  has  A  =  (2  5)  and  a  =  0.  The  mapping  induced  by  this  net  is 
R^  R  :  u  2(74  (3<ri(5ui  -  u?)  +  2ct2(«i  +  2u2  +  1)  -1- 1)  +  Scrs^  -  3cr2(tii  +  2u2  -f  1)  -  . 

2.1.1  Particular  Case:  Single-Hidden  Layer  Nets 

A  particular  case,  which  appears  often  in  the  discussions  to  follow,  is  that  of  homogeneous, 
scalar-output,  single  hidden  layer  nets  (just  IHL  from  now  on).  A  fimction  /  can  be  expressed 
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as  the  input/output  mapping  induced  by  a  IHL  net  with  activation  a  if  eind  only  if  /  belongs  to 
the  afl&ne  spein  of  the  (multivariate)  dilations  and  translates  of  cr; 

n 

f{u)  =  Co  +  ^  Ci(T(AiU  +  Oj)  =  +  a)  +  Co  ,  (7) 

t=i 

where  we  have  taken  =  (C7,co,id),  and  we  are  denoting  by  Ai  the  zth 

row  of  the  matrix  A.  a  =  col(ai, . . .  ,an),  and  C  =  (ci,...,c„).  Letting  A  =  (ajj)  and  u  = 
col(ui, . . . , Um)-  the  corresponding  IHL  net  may  be  represented  diagrammatically  as  in  Figure  3. 


y 


Figure  3:  Homogeneous  Scalar-output  One  Hidden  Layer  (IHL)  Net 

The  interest  in  IHL  nets,  besides  their  simplicity  and  their  resemblance  to  related  mathemat¬ 
ical  constructs  (functional  expansions  of  various  Uiies,  wavelets,  etc)  is  due  in  Icirge  part  to  the 
density  properties,  of  ftmctions  computed  by  IHL  nets,  with  respect  to  various  spaces  (continu¬ 
ous  functions,  LF  spaces,  p  <  oo,  Sobolev  spaces  under  additional  differentiability  hypotheses). 
It  is  by  now  well-known,  thanks  to  the  pioneering  work  of  Cybenko,  Hornik,  White,  Leshno,  and 
others,  that  IHL  networks  have  these  universal  approximation  properties,  assiuning  very  little  on 
the  activation  a  besides  it  not  being  polynomial  (for  simple  proofs,  which  apply  to  restricted  yet 
quite  general  activations,  see  [32]). 

Of  course,  one  should  not  read  too  much  into  this  universality  property,  since  better  rates 
of  convergence  in  function  approximation,  or  better  classifiers  for  pattern  recognition,  or  better 
feedback  control  laws,  may  be  obtainable  with  more  complex  networks  than  IHL  ones.  (As  an 
analogy,  polynomials  are  dense  iii  spaces  of  continuous  functions  but  they  are  not  necessarily  the 
right  choice  in  approximation  problems,  where  other  families  of  fimctions,  such  as  splines,  are 
often  preferred.)  This  has  been  remarked  often  in  the  literature;  see  for  instance  [10]. 

In  fact,  two-hidden  layer  nets  turn  out  to  be  more  appropriate,  from  a  theoretical  standpoint, 
for  certciin  problems.  For  instance,  the  characteristic  function  of  a  squeire  in  the  plane  can  be 
easily  approximated  by  the  output  of  a  homogeneous  two-hidden  layer  net  (with  various  types 
of  a),  but  good  approximations  using  IHL  nets  require  many  terms  (no  matter  what  a).  A  basic 
problem  is  that  uniform  approximation  of  discontinuous  functions  is  in  general  impossible  using 
IHL  nets;  technically,  there  is  no  density  on  L°°,  even  on  compact  subsets;  as  a  matter  of  fact,  an 
even  stronger  negative  result  holds  regarding  the  impossibility  of  constructing  sections  of  certain 
coverings,  as  discussed  in  [10].  This  has  serious  implications  in  solving  inverse  problems,  such  as 
inverse  kinematics  computations  in  robotics,  and  indeed  has  been  noted  experimentally  by  many 
authors.  The  same  problem  appears  in  control  applications,  as  explained  later. 
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2.2  Definitions:  Feedback  Nets 

By  an  n-diraensional,  m-input,  p-output  recurrent  net  we  mean  a  quadruple  {A.B,C,(t)  con¬ 
sisting  of  three  matrices  A  €  B  G  C  G  and  a  diagonal  mapping  <t  as  in 

Equation  (5).  An  initialized  recurrent  net  is  a  5-tuple 

E  =  {A,B,C,x°,cr)  (8) 

consisting  of  a  recurrent  net  {A,B,C,(t)  together  with  a  vector  G  E”.  The  (discrete  or 
continuous  time)  system  induced  by  the  net  (8)  is  the  set  of  n  coupled  (difference  or  differential, 
respectively)  equations,  plus  measurement  function:  . 

x(t -I- 1)  [ora:(f)]  =  tr  (Ax{t)  +  Bu{t))  ,  i(0)  =  x°,  y{t)  =  Cx{t).  (9) 

The  component  maps  ci, . . .  ,-<7„  of  <t  are  the  activations  of  the  net.  If  it  is  the  case  that  ciU  the 
Oi  are  equal  to  a  fixed  fimction  <t,  we  say  that  the  net  is  homogeneous  witi  activation  cr  and 
write  also  instead  of  <r.  The  spaces  E*",  E".  and  RP  are  called  respectively  the  input,  state, 
and  output  spaces  of  the  net. 

As  an  illustration,  take  again  the  diagram  in  the  right  part  of  Figure  1.  This  is  a  pictorial 
representation  of  the  2-dimensional.  2-input,  single-output  recurrent  net  defined  by 

'^=(0  -l)'  ®=('o'  0’  '=(^0 

and  inducing  the  system  (3). 

In  the  present  context,  one  interprets  the  vector  equations  for  x  in  (9)  as  representing  the 
evolution  of  an  ensemble  of  n  “neurons,”  where  each  coordinate  Xi  of  x  is  a  real-\^ued  variable 
which  represents  the  internal  state  of  the  ith  neuron,  and  each  coordinate  u,  ,  i  =  1, . . . ,  m  of  u 
is  an  external  input  signal.  The  coefficients  Aij,Bij  denote  the  weights,  intensities,  or  “synaptic 
strengths,”  of  the  various  connections.  The  coordinates  of  y{t)  represent  the  output  of  p  probes, 
or  measurement  devices,  each  of  which  averages  the  activation  values  of  many  neurons.  Often 
C  is  just  a  projection  on  some  coordinates,  that  is,  the  components  of  y  are  simply  a  subset 
of  the  components  of  x.  Several  motivations  for  the  study  of  recurrent  nets  axe  mentioned  in 
sections  1.3  and  2.4. 

The  linear  systems  customarily  studied  in  control  theory  are  precisely  the  homogeneous  re¬ 
current  nets  with  identity  acti\-ation  and  x°  =  0.  This  formal  analogy  to  linear  systems,  for 
which  a  very  detailed  theory  has  been  developed  (see  e.g.  the  textbook  [5])  turns  out  to  be  very 
fruitful,  as  we  shall  explain  later. 

One  also  writes  (9)  simply  as  Ax  =  a  {Ax  +  Bu),  x(0)  =  x°,  y  =  Cx.  where  Ax  =  x"*" 
(time-shift)  or  Ax  =  x  (time  derivative)  in  discrete  or  continuous  time  respectively.  Figure  4 
gives  a  “block  diagram”  of  (9). 

To  each  initialized  recurrent  net  (A,  B,  C,  x°,  <r)  we  associate  a  discrete  time  and  a  continuous 
time  input /output  behavior.  Assume  given  a  sequence  u  =  «(0), . .  .,u{k  —  1)  of  elements  of  the 
input  space  E”* .  One  may  iteratively  solve  the  difference  equation  (9)  starting  with  x(0)  = 
thereby  obtaining  a  sequence  of  state  vectors  x(l), . . . ,x(A:).  In  this  manner,  each  initialized 
recurrent  net  induces  a  mapping,  on  inputs  of  fixed  length  k, 

A*  :  (R'”)*-^RP  :  u  y{k)  =  Cx{k) 


(10) 
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Figure  4:  Recurrent  Net 


which  assigns  to  the  input  u  the  last  output  produced  in  response.  Sometimes  one  also  introduces 
Aj,  :  KP  on  all  possible  sequences  by  letting  the  restriction  of  Aj,  to  (1"*)^  be  A*. 

In  continuous  time,  analogous  maps  are  defined  on  inputs  u  :  [0,  k]  — »•  R”** ,  through  solution  of 
an  initial  vzdue  problem  with  z(0)  =  This  requires  certain  technical  hypotheses  (for  instcince, 
Lipschitz  continuous  activations,  and  inputs  that  cire  measurable  and  locally  integrable)  which 
we  do  not  need  to  discuss  here;  see  [5]  for  the  precise  definitions. 

Several  variations  of  the  above  model  can  be  foimd  in  the  literature,  such  as  systems  of  the 
following  form: 

X  =  —Dx  +  <r  {Ax  +  Bv) ,  y  =  Cx , 

where  D  a  diagonal  matrix  (for  “Hopfield  nets”  one  picks  in  addition  A  symmetric)  and  A,  B.  C 
are  as  before.  Observe  that,  if  D  has  negative  entries,  and  if  the  activations  are  bounded,  all 
solutions  x{t)  in  this  equation  are  bounded  (because  the  linear  term  dominates,  for  large  z);  it 
is  this  stability  property  that  makes  the  variation  appealing  in  applications.  In  other  variants, 
the  input  term  Bu  may  be  outside  the  nonlinearity.  The  paper  [14]  showed  how,  at  least  for 
certain  problems,  it  is  possible  to  transform  among  the  different  models,  in  such  a  way  that  once 
that  results  are  obtadned  for  (9),  corollaries  for  the  variants  are  easily  obtained.  Thus,  here  we 
restrict  attention  to  the  simpler  form  (9),  which  has  the  advantage  of  requiring  one  less  object 
(no  need  for  the  matrix  D)  for  its  specification. 

One  may  broaden  the  notion  of  initialized  recurrent  net  by  allowing  “biases”  or  “offsets” .  i.e. 
nonzero  vectors  d  €  R”  and  e  6  RP  in  the  update  and  the  measurement  equations  respectively. 
These  equations  would  then  take  the  more  general  form  z"*"  =  aiAx+Bu+d),  y  =  Cx+e.  Despite 
the  fact  that  biases  are  useful,  we  do  not  need  to  include  such  an  extension  in  the  formal  definition. 
This  is  because  the  input /output  behavior  of  any  such  net  also  arises  as  the  input /output  behavior 
of  a  net  in  the  sense  defined  earlier  (zero  biases),  with  state  space  R”'*'^  and  same  activations. 
The  simulation  is  achieved  by  means  of  the  introduction  of  an  additional  variable  z  whose  value 
is  constantly  equal  to  a  nonzero  number  zq  in  the  range  of  one  of  the  activations,  say  a,  in  such  a 
manner  that  the  equations  become  z'*'  =  (r{Ax  +  zd'  +  Bu),  z"*"  =  <t(ooz),  y  =  Cx  +  ze',  where  oo 
is  chosen  so  that  o’(aozo)  =  zq  and  d',e'  are  so  that  zod'  =  d  and  zoe'  =  e  (if  the  only  activation 
is  <r  =  0,  there  would  be  nothing  to  prove). 

2.3  Architectures 

Roughly,  by  an  “architecture”  one  means  a  choice  of  interconnection  structure  and  of  the  activ¬ 
ation  functions  a  for  each  neuron,  leaving  weights  and  initial  states  unspecified,  as  parameters. 
One  may  also  stipulate  that  the  initial  state,  or  just  certain  specific  coordinates  of  it,  should 
be  zero  (as  with  linear  systems  in  control  theory).  Once  than  the  architecture  is  fixed,  feedfor¬ 
ward  neural  networks  provide  parametric  families  of  functions,  alternatives  to  more  traditional 
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parametric  sets  of  functions  such  as  polynomials,  splines,  rational  functions,  finite  Fourier  expcui- 
sions,  or  'wavelets,  and  of  potential  use  in  areas  such  as  function  approximation  and  interpolation . 
Similarly,  feedback  networks  with  a  fixed  architecture  provide  ptirametric  classes  of  dynamical 
systems,  and  as  such,  universal  models  for  identification  eind  adaptive  control.  We  formalize  the 
notion  of  architecture  by  means  of  incidence  matrices,  employing  binary  matrices  in  order  to 
specify  the  allowed  interconnection  patterns  and  initial  states. 

By  an  (m.  n)-layer  architecture  we  metin  a  triple 

(11) 


consisting  of  a  matrix  I  €  {0,1}”^”*,  a  vector  J  €  {0,1}**.  and  a  diagonal  mapping  o-  as  in 
Equation  (5).  As  before,  we  call  the  component  maps  £ri,...,CTn  the  activations  of  the  layer 
architecture.  Let  k  he  &  positive  integer.  A  k-hidden  layer  net  architecture  is  a  A:  +  1-tuple 
A  =  (C^.. . .  of  layer  architectures,  where  is  homogeneous  with  identity  activation 

(t{x)  =  X.  As  earlier,  we  say  that  the  architecture  is  homogeneous  with  activation  o-  if  cJi  the 
activations  of  the  layers  £‘,  i  =  1, . . . ,  A:  are  equal  to  the  fixed  function  a.  A  net  with  architecture 
is  an  instantiation  obtained  by  choosing  \‘alues  for  the  nonzero  entries,  that,  is,  any  A:-hidden 

layer  net  (I!\ . . . ,  which  is  so  that,  for  every  t  =  1, - A:  +  1  the  following  property  holds: 

if  E'  =  (A.  a,  <t)  and  £*  =  (/,  J,  <t'),  then  a  =  a'  and  the  entries  of  the  matrix  and  vector  satisfy 
Afu/  =  0  whenever  =  0  and  =  0  whenever  =  0. 

Let  A  =  (£S . . . ,  £*‘*'M,  where  £'  is  an  (ni_i,ni)-layer  (/*,  We  again  call  the  integer 

n  :=  ni  -r  .  . .  -f-  n*  the  number  of  neurons.  Suppose  that  the  binary  matrix  P  has  exactly  Aj 
nonzero  entries  and  the  binary  vector  P  has  exactly  m  nonzero  entries.  Then  we  say  that  the 
number  r  :=  Ai  -I- . . .  +  A/t+i  +  n\  +  ...  +  fik+i  is  the  number  of  parameters  or  weights  of  A.  and 
call  E’’  the  parameter  or  weight  space,  and,  as  before,  E"*  and  E^*  the  input  and  output  spaces  of 
A.  Order  the  indices  of  the  nonzero  entries  of  the  P’s  and  J*’s  in  any  fixed  manner,  for  instance 
by  listing  the  nonzero  entries  of  P  row  by  row,  then  those  of  P ,  and  so  on  up  to  .  These 
indices  are  in  one-to-one  correspondence  with  the  coordinates  of  vectors  in  E'^ .  In  this  manner, 
one  may  \iew  the  architecture  A  as  inducing  a  mapping 

/^  :  E’’  X  E"*  EP 

where  f{p.  •)  is  the  mapping  induced  by  the  net  (E^, . . . ,  that  would  result  if  we  substituted 
the  parameters  p  into  the  nonzero  entries.  We  denote  by 

Pa-=  {fA{p,-),per}  (12) 


the  class  of  functions  E"*  EF  thus  associated  to  the  architecture  A. 

As  an  example,  the  diagram  in  the  left  part  of  Figure  1  is  a  net  with  architecture  (£\  £^.  £^), 
where  £*  has 

'■(:  D-o—c) 

(we  used  a  dotted  arrow  in  Figure  1  to  exhibit  a  bias  that  is  necessarily  zero,  due  to  the  zero 
entry  in  J).  £^  has 

^=(o  l)  ’  -^"(l)  ’  ’ 

and  layer  £^  has  /  =  (1  1)  and  J  =  0.  The  mapping  induced  by  this  architecture  is 
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where  f{p,u)  equals: 

+  P2l‘2)  +P7<72(P3l^l  +P4U2  +  /35)  +^9^  +  Pl2C’’3  (p8'72(P3Ul  +  P4U2  +P5)  +  P10) 

and  is  the  set  of  functions  f{p,  •)  obtained  for  each  fixed  choice  of  the  parameter  vector  p. 

We  next  define  similar  notions  for  the  feedback  case.  By  an  n-dimensional.  m-input,  p-output 
recurrent  architecture  we  mean  a  5-tuple 

=  (a,/3,7,^,<T)  (13) 


consisting  of  three  matrices  a  €  {0,1}”^",  (3  €  {0,  and  7  €  {0,1}^^”,  a  vector  ^  G 

{0, 1}”,  and  a  diagonal  mapping  a  as  in  Equation  (5). 

An  initialized  recurrent  net  with  architecture  is  an  instantiation  obtained  by  choosing  values 
for  the  nonzero  entries,  that,  is,  any  initialized  recurrent  net  (A,B,C,x°,er')  such  that  cr  =  er' 
and  the  entries  of  the  matrices  and  vector  satisfy  Aij  =  0  w’henever  =  0,  Bij  =  0  whenever 
0ij  =  0,  Cij  =  0  whenever  7jj  =  0,  and  xf  =  0  whenever  =  0. 

We  say  also  here  that  the  component  maps  ai,...,a„  of  <r  me  the  activations  of  the  net, 
which  is  homogeneous  with  activation  tr  if  all  ai  are  equal  to  a  fixed  function  cr.  The  spaces  R"* , 
R”,  and  R^  me  respectively  the  input,  state,  2uid  output  spaces  of  the  mchitecture.  Suppose  that 
the  binary  matrices  a,  /?,  and  7  and  the  vector  ^  have  exactly  k,  A,  p,  and  1/  nonzero  entries 
respectively;  then  we  call  the  number  r:=K  +  A-f/x-l-i/the  number  of  parameters  or  weights 
of  A,  and  call  R’’  the  pmameter  or  weight  space.  Arrange  the  indices  of  the  nonzero  entries  in 
any  fixed  manner,  for  instance  by  listing  their  nonzero  entries  row  by  row,  for  a,  0,  7,  and  ( 
in  that  order.  These  indices  me  in  one-to-one  correspondence  with  the  coordinates  of  vectors  in 
R*".  In  this  manner,  one  may  view  the  mchitecture  A  as  representing  a  pmameterized  system 
(in  continuous  or  discrete  time)  Ax  =  tr  {ax  +  0u),  x(0)  =  j/  =  71  where,  by  substituting  the 
pmameters  p  G  R*"  into  the  nonzero  entries  of  (a,/3,7,^),  every  possible  initialized  recurrent  net 
E  =  A{p)  with  mchitecture  A  results. 

For  example,  the  diagram  in  the  right  pmt  of  Figure  1  is  a  recurrent  net  with  mchitecture 
A,  where 

, 

and  the  (continuous-time)  corresponding  pmameterized  system  (with  pmameter  space  R^ )  is 


Xi{t)  =  <Ti{piXi{t)  +  P2X2{t)  +  PiUiit)  +  p5U2it))  (14a) 

X2{t)  =  cr2{p3X2{t)  +  p6U2{t))  (14b) 

y{t)  =  pjxiit).  (14c) 

Recalling  the  notations  in  Equation  (10),  for  each  recurrent  mchitecture  A  and  each  k  >  0, 
we  may  introduce  the  set 

:=  S  =  >l(p),per}  (15) 


of  mappings  (R*?)*^  -»•  IP.  Elements  of  this  set  me  the  input /output  mappings  induced  on  inputs 
of  length  k  by  each  possible  initialized  recurrent  net  with  mchitecture  A. 
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2.4  A  General  Framework  for  Learning 

In  this  section,  we  briefly  review  a  formulation  of  the  problem  of  learning.  Our  purpose  in  doing 
so  is  to  motivate,  and  to  set  the  stage  for  a  description  of,  some  of  our  main  lines  of  research. 
We  chose  to  base  the  discussion  in  terms  of  the  paradigm  currently  popular  in  computational 
learning  theory  (COLT),  a  formulation  which  also  appears,  in  a  slightly  different  form,  in  related 
statistics  areas  of  parametric  and  nonparaimetric  estimation.  (The  COLT  area,  considered  part 
of  theoreticaf  computer  science,  is  one  of  considerable  activity  at  present,  as  evidenced  by  the 
profusion  of  papers  and  conferences  on  the  subject.)  The  language  to  be  introduced  affords 
a  way  to  systematically  explore  the  capabihties  of  neural  networks  in  the  context  of  a  precise 
mathematical  formalism,  but  alternative  formulations  of  the  same  fundamental  questions  are  also 
possible,  based  on  other  methods  from  statistics  and  numerical  analysis. 

The  questions  that  we  deal  with  can  informally  be  described  as  concerning  the  choice  of 
suitably  “good”  responses  (outputs)  to  given  stimuli  (inputs).  In  applications,  this  may  translate 
into  tasks  as  diverse  as:  in  state  feedback  control  problems,  finding  a  control  that  produces 
a  desired  action  as  a  function  of  the  current  state;  in  OCR  algorithms,  guessing  the  correct 
character  given  a  digitized  image  of  a  letter:  or  in  market  analysis  programs,  deciding  whether  to 
buy  or  sell  a  stock  based  on  a  time- window  of  price  data.  “Learning”  comes  into  the  picture  if  we 
assume  that,  as  part  of  a  preliminary  “training”  phase,  we  were  supplied  with  a  representative 
sample  of  such  “good”  responses,  and  our  goal  is  to  form  associations  which  allow  us  to  produce 
good  responses  for  future  inputs,  including  of  course  inputs  which  were  not  seen  exactly  during 
the  training  period  (we  haven’t  seen  all  possible  apples  and  oranges,  yet  can  reliable  distinguish 
between  them).  It  is  this  last  quality  of  generalization  (interpolation/extrapolation  in  an  abstract 
sense)  that  makes  the  problem  interesting:  otherwise  we  could  just  in  principle  simply  store  all 
training  data,  and  use  efficient  database  retrieval  methods  for  finding  appropriate  responses.  One 
could  say  that,  in  some  implicit  sense,  the  goal  of  a  learning  system  is  to  extract  decision  rules 
from  training  data,  although  the  formalization  to  be  discussed  next  does  not  make  rule-extraction 
explicit. 

Of  course,  the  problem  of  generalization  is  not  well-posed  unless  we  make  some  prior  as¬ 
sumptions  about  what  the  “good”  input /output  pairs  are.  In  numerical  analysis,  such  prior 
assumptions  take  the  form  of  smoothness  constraints;  in  classical  statistics  one  postulates  e.g. 
normality  of  distributions.  The  general  approach  in  learning  theory  is  to  postulate  that  there  is 
a  probability  distribution  P  on  input/output  pairs  (tx,  y)  which  generates  the  “good”  pairs,  in 
the  sense  that  the  desirable  responses  for  a  given  u  are  those  y  for  which  the  probability  of  y 
given  the  observed  u  is  in  some  sense  maximized.  The  distribution  P  may  be  unknown,  but  it 
will  be  assumed  to  be  a  member  of  a  fixed  class  of  distributions  V  (e.g..  normal  distributions 
with  different  means  and  standturd  deviations,  uniform  distributions  with  different  supports,  or 
even  the  set  of  all  probability  measures  with  respect  to  a  fixed  cr-algebra).  The  most  important 
assumption  is  one  of  stationarity:  both  the  training  samples  and  the  data  seen  in  the  future 
(testing  data)  are  assumed  to  be  randomly  drawn  according  to  the  same  P. 

This  general  idea  includes  mzmy  cases  of  interest.  A  most  important  special  case  is  that  of 
function  learning,  in  which  the  “good”  pairs  are  those  that  belong  to  the  graph  of  an  unknown 
“target”  (or  “oracle,”  or  -teacher”)  function  /i,  whose  behavior  the  learner  attempts  to  emulate. 
We  assume  that  inputs  u  are  generated  randomly  according  to  a  probability  distribution  Pq, 
and  for  each  input  so  generated,  the  target  \-alue  y  =  h{u)  is  provided  to  the  learning  system. 
(At  this  level  of  abstraction,  function  learning  includes  systems  identification,  in  which  case  h 
would  represent  a  plant  to  be  identified  from  its  responses  h{u)  to,  for  instance,  white  noise 
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inputs  u.)  This  fits  in  the  above  setting  roughly  as  follows.  There  is  a  natural  probability  P/, 
induced  on  pairs  (u.y):  {u,y)  has  the  same  probability  as  (u,h{u)),  and  has  zero  probability 
otherwise.  Estimating  Ph  amounts  to  learning  h.  A  related  example  is  that  in  which  the  learner 
only  has  access  to  noisy  measurements  of  /i(u);  in  that  case,  one  may  in  ^  obvious  manner 
induce  a  probability  on  pairs  (u,  y)  which  incorporates  the  effects  of  noise.  (Of  course,  we  are 
being  extremely  informal  right  here,  for  instance  because  what  we  just  said  is  not  correct  for 
continuous  domains,  where  one  must  argue  in  terms  of  probabilities  with  respect  to  suitable  a- 
tilgebras:  the  purpose  is  to  explain  intuitively  the  basic  definitions.  Much  more  discussion  Cein 
be  found  in  standard  texts  in  machine  learning  and  learning  theory,  including  explanations  of 
relations  to  Bayesian  anzilysis  and  classical  statistical  techniques.) 

In  e%'aluating  performance  of  a  learning  system,  one  has  to  allow  for  the  facts  that.  (1) 
accidentally,  training  data  may  not  have  been  rich  enough,  as  a  sample,  to  identify  the  unknown 
distribution,  which  means  that  our  confidence  in  future  predictive  behavior  can  never  be  perfect, 
and  (2)  even  if  the  training  data  was  indeed  representative,  that  it  may  well  be  the  case  that  a 
new,  unseen,  input  to  be  responded  to  may  itself  be  an  “outlier”  with  respect  to  typical  inputs, 
so  our  prediction  may  not  be  totally  accurate.  These  two  potential  sources  of  uncertainty  give 
rise  to  the  notion  of  “probably  approximately  correct”  learning,  in  which  the  goal  is  to  be  able  to 
provide,  with  high  confidence,  a  reasonably  accurate  response  to  future  inputs.  (Two  numbers  in 
the  range  [0.  l]j  typically  denoted  by  e  and  S,  appear  in  all  formulations,  to  denote  respectively 
these  accuracy  and  confidence  levels.) 

2.4.1  PAC  Learning 

We  now  review  briefly  the  basic  “agnostic  PAC  learning”  formalism.  Two  sets  U  and  y  are 
given,  the  sets  of  input  and  output  values  respectively.  Although  the  theory  can  be  derived  in 
more  generality,  we  will  assume  here  that  i/  is  a  complete  separable  metric  space  (topically,  in 
any  case.  W  is  a  closed  subset  of  an  Euclidean  space)  and  that  ^  is  a  compact  subset  of  R 
(far  more  general  outputs  could  equally  well  be  considered,  but  notations  become  slightly  more 
involved).  .41so  given  is  a  family  of  Borel  probability  measures  V  onU  x  y  and  a  fardily  E  of 
Borel  measurable  maps  f  :U  y. 

By  an  identifier  we  mean  a  map  where  we  are  denoting  by  fi  the  set  of  all  finite 

sequences  u  =  (ui,yi), . . . ,  {ua,ya)  of  elements  oiU  xy  (with  varying  lengths  s).  We  write  the 
value  of  (f  on  a  sequence  w  as  ;  thus  is  itself  a  function  U  -^y. 

As  discussed  in  previous  paragraphs,  the  role  of  a  P  €  P  is  to  represent  the  distribution  of 
input /output  data  to  be  learned,  while  tin  element  /  €  P  describes  the  function  used  by  the 
learner  to  fit  the  data.  In  our  discussions.  T  will  usually  be  the  class  (cf.  Equation!  12)) 
consisting  of  those  mappings  /(p,  •)  associated  to  the  possible  nets  with  a  given  architecture  A. 
The  element's  of  T  are  often  called  hypotheses.  Thus  an  identifier  is  a  method  of  assigning  a 
candidate  hypothesis  /  G  P  based  on  the  training  data  w  that  has  been  observed  in  the  past 
(we  prefer  not  to  use  the  term  “algorithm”  for  (p,  since  this  term  connotes  computability,  and 
we  wish  to  separate  questions  of  computability  from  information-theoretic  concepts).  We  next 
quantify  the  performance  of  identifiers,  assuming  that  the  training  and  test  data  is  being  drawn 
according  to  a  distribution  P  €  P. 

Assume  that  a  probability  P  G  P  has  been  chosen.  The  error  of  the  identifier  cp  with  respect 
to  the  probability  measure  P  and  the  (training)  sequence  i/  G  fl  is  defined  as  the  expectation 
(with  respect  to  P)  of  the  squared  error  (“risk”),  on  test  data,  of  the  function  produced  by  the 
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identifier: 

Err(P,(^,i/)  :=  f  {y  -  <p^iu)f  dP{u,y) . 

Juxy 

(We  use  squared-error  for  simplicity  of  exposition,  but  more  arbitrairy  loss  criteria  could  be 
employed  as  a  definition.)  In  a  special  case  of  wide  interest,  that  of  binary  classification  problems, 
the  output  value  set  y  consists  just  of  two  elements  {0, 1};  in  that  case,  the  expression  Err  (P,  ip,  u) 
simply  represents  the  probability  y  ^  of  misclassification. 

The  best  achievable  performance  (lowest  possible  error)  among  all  /  €  P,  if  the  test  samples 
are  distributed  according  to  P,  is: 

ErL(P)  :=  inf  /  {y  -  f{u)f  dP{u,y) . 

Since  (pu  e  P,  Err.(P)  <  Err  (P,i^,i/),  for  all  identifiers  tp  and  all  training  data  i/.  A  measure 
of  merit  for  any  given  tp  can  fie  formulated  in  terms  of  the  gap  between  these  numbers:  we  ask 
that,  for  most  training  instances,  the  gap  be  small.  More  precisely,  an  identifier  (p  is  said  to 
be  probably  approximately  correct  (PAC)  with  respect  to  the  family  of  probabilities  V  and  the 
hypothesis  class  P,  if  for  each  e  >  0  and  each  (5  >  0  there  is  some  integer  s  =  s{e,S)  such  that, 
for  every  P  €  P,  ■ 

P*{Err(P,</j,i/)  <ErL(P)+£}  >  l-<5.  (16) 

Note  that  the  probability  on  training  scimples  u  is  being  understood  with  respect  to  the  in¬ 
duced  measure  P®  on  the  product  space  {U  x  >^)®,  In  other  words,  one  is  asking  that  the  event 
“|Err  (P,  ((3,1/)  —  Err  fPlI  <  e”  becomes  almost  sure  as  s  oo,  uniformly  on  P  €  P.  li  (p  is 
PAC,  we  may  define  a  function  s  :  R>o  x  R>o  -4  Z  by  letting  s{e,6)  be  the  smallest  integer  for 
which  equation  (16)  holds  for  all  P  €  P;  this  function  is  often  called  the  sample  complexity  of  (p. 

If  a  PAC  identifier  exists,  we  say  that  (P,P)  is  PAC  (or  uniformly)  leamable.  The  most 
desirable  specicd  case  is  that  in  which  P  is  the  family  of  all  (Borel)  probability  measures;  if 
(P,  P)  is  PAC  learnable  with  this  P,  one  says  simply  that  P  itself  is  PAC  learnable. 

From  the  work  of  Vapnik,  Pollcurd,  Haussler,  and  others,  there  follows  a  simple  yet  powerful 
sufficient  combinatorial  condition  for  PAC  learnability  of  P:  i/P  has  finite  pseudo-dimension, 
then  it  is  PAC  leamable.  (We  review  in  Section  3.2  the  meaning  of  pseudo-dimension,  and  in 
particulau'  the  notion  of  Vapnik-Chervonenkis  dimension.  They  characterize  the  richness  of  P  as 
a  class  of  functions,  expressed  in  terms  of  discrimination  power.)  Furthermore,  in  that  case  it  is 
possible,  at  least  in  principle,  to  design  an  identifier  using  the  most  “naive”  approach  of  fitting 
an  /  €  P  as  well  as  possible  to  the  observed  data,  and  then  using  this  /  for  prediction.  We  make 
this  precise  next. 

Given  a  sample  u  =  (uj,  yi ),...,  (u,,  yj),  for  each  possible  fimction  /  €  P  we  may  consider 
the  quantity 

Emp(/,i/)  := 

t=l 

which  represents  the  ‘‘fitting  error”  or  “empirical  risk”  that  /  makes  on  the  given  sample.  Con¬ 
sider  cdso,  for  this  same  sample,  the  smallest  possible  error  achievable  using  the  given  hypotheses 
class: 

Emp(i/)  :=  in^Emp  (/,!/) .  (17) 

fe^ 

Note  that  the  calculation  of  Emp(i/),  for  observed  training  data  i/,  as  well  as  the  finding  of  an 
element  f  £  which  provides  a  near  optimal  value  (i.e.,  so  that  Emp  (/,  u)  —  Emp  (i/)  is  small), 
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constitutes  an  optimization  problem  over  When  ^  is  specified  in  terms  of  a  a  finite  number 
of  parameters  (such  as  the  adjustable  weights  in  a  neural  network),  this  becomes  a  problem  of 
finite-dimensional  optimization.  With  these  notations  we  can  state  the  by-now  classical  results: 

If  the  pseudo-dimension  d  of  T  is  finite,  then: 

1.  there  is  an  identifier  with  sample  complexity 

s(£,S)  <  ^  ^dlnj  -I-  In^^  . 

where  c  is  a  constant*  that  depends  on  y,  and 

2.  such  an  identifier  can  be  obtained  by  empirical  risk  minimization:  to  achieve  the  error 
estimate  in  Equation  (16),  for  a  given  6,e,  and  given  the  sample  v  =  (ui,  j/i), .... 

with  length  s  >  s(e,  (5),  it  is  enough  to  find  any  function  f  with  the  following  property: 

Emp  (/,  I/)  <  Emp  (t^)  +  |  •  (18) 

and  to  define  to  be  this  function  f. 

(Even  more  interesting,  this  almost-minimization  may  be  performed  by  a  probabilistic  cJ- 
gorithm,  in  that  it  is  enough  that  the  almost-optimality  estimate  in  (18)  hold  with  high  probabil¬ 
ity.)  Through  empirical  risk  minimization,  the  theory  of  PAC  learning  makes  contact  with,  and 
provides  an  elegant  generalization  of,  classical  questions  regarding  the  uniform  convergence  of 
empirical  means  of  random  variables  and.  more  generally,  probabilities  of  sets,  and  in  particular 
the  Glivenko-Cantelli  Lemma. 

It  should  be  emphasized  that  the  finiteness  of  pseudo-dimension  is  merely  a  sufficient,  not 
a  necessary,  condition,  and  that  much  research  nowadays  concerns  the  development  of  weaker 
conditions  for  learnability  (viz.  the  area  of  “fat  shattering”).  However,  as  we  shall  see.  it  is  a 
very  useful  condition,  and,  moreover,  in  the  special  case  of  binary  valued  outputs,  which  is  the 
case  of  interest  in  pattern  recognition  and  classification,  the  condition  turns  out  to  be  necessary 
as  well. 

2.4.2  Subproblems 

The  above  discussion  leads  to  the  following  genercd  questions,  for  any  given  T  and  V: 

Q1  How  large  can  the  errors  Err  (PI  be? 

Q2  Does  P  have  finite  pseudo-dimension? 

Q3  Is  it  computationally  feasible  to  find  Emp  (t/)? 

Q4  What  are  the  properties  of  the  error  fimction  being  minimized  in  (17)? 

The  first  question  characterizes  the  minimum  potentially  achievable  error,  no  matter  how 
many  samples  are  seen  or  how  powerful  2in  algorithm  is  used:  the  smaller  Err  (P),  the  more 
useful  is  the  estimate  in  Equation  (16).  It  leads  to  function  approximation  (of  densities,  if  the 


’A  far  more  explicit  estimate  is  known,  but  we  absorb  many  constants  into  c  in  order  to  exhibit  the  dependence 
on  d,£,S  as  simply  as  possible;  we  also  must  assume  that  e  is  small  enough,  e.g.  e  <  1/2. 


2  OVERVIEW 


19 


P’s  are  described  in  such  terms)  by  elements  of  T .  The  line  of  work  in  the  PI  papers  44]  and 
[21-  was  motivated  in  this  manner;  see  Section  3.3.1  of  the  report.  The  second  question,  whose 
answer  helps  determine  if  it  is  possible  to  learn  at  all  (using  finite  samples)  is  more  combinatoricil 
in  nature,  and  in  particular  the  study  of  lower  bounds  relies  on  finding  explicit  constructions  to 
implement  a  large  variety  of  functions.  In  this  direction,  relevant  PI  work  is  represented  by  [8], 
[111.  [39  .  [53],  [25],  and  [26]  as  well  as  related  results  for  recurrent  nets  to  be  mentioned  later:  see 
Sections.  3.2.1  and  3.2.3  of  the  report.  The  third  question  is  of  a  computational  complexity  nature 
(NP-completeness),  and  its  study,  at  least  for  certain  activations,  requires  simple  techniques  from 
combinatorics  and  discrete  mathematics.  RelevcUit  PI  work  is  exemplified  by  [48]  and  [20];  see 
Section  3.5  of  the  report.  Finally,  the  last  question,  whose  answer  impacts  the  performance  of 
numerical  algorithms  based  on  gradient  descent  (so-called  “backpropagation”),  leads  to  issues 
of  Morse  theory,  stratifications  of  subanalytic  sets,  model  theory  on  logic,  as  well  as  elementary 
analysis,  .\mong  PI  work  in  this  direction  are  the  references  [7],  [9],  [50],  [22];  see  Section  3.1  of 
the  repon. 

Section  3  of  the  report  discusses  some  of  these  issues  in  more  detail,  though  space  constraints 
do  not  allow  a  technical  development  and  the  references  should  be  consulted. 

2.4.3  Learning  and  Recurrent  Nets 

As  explained  in  the  introduction  to  this  report,  in  many  applications  involving  pattern  classifica¬ 
tion  or  learning,  input  data  arrives  naturally  as  a  time  series.  Inputs  fed  into  a  speech  recognition 
system  are  often  sequences  consisting  of  windowed  Fourier  coefficients  and  in  control  problems 
inputs  to  a  regulator  may  be  sequences  of  measurements  of  the  plant  being  controlled  as  well  as 
the  successive  coordinates  of  a  path  to  be  tracked.  Under  these  circumstances,  a  leaurning  system 
should  exploit  the  information  inherent  in  the  correlations  and  dependencies  that  exist  among  the 
terms  of  the  input  sequence.  One  way  to  take  into  account  this  additional  structure  is  through 
the  use  of  hypotheses  classes  which  consist  of  dynamical  systems.  Kalman  filtering,  which  relies 
on  hnear  dynamical  systems  for  extracting  information  (filtering  of  noise)  from  a  stream  of  data, 
is  perhaps  the  most  successful  known  example  of  an  application  of  this  principle. 

In  the  notations  and  terminology  established  earlier,  and  for  concreteness  dealing  with  the 
discrete-time  case,  the  inputs  u,  elements  of  the  set  U,  are  now  finite  sequences  (which,  depending 
on  the  problem  being  studied,  could  be  of  a  fixed  length,  or  of  varying  lengths).  We  assume  that 
the  output  space  is  DL  The  hypotheses  classes  T  consist  of  systems  which  start  from  a  fixed  initial 
state,  evolve  according  to  internal  dynamics  forced  by  the  external  input  sequence  u  =  ui —  ,  Uk, 
and  produce  real-number  outputs  yt,t  =  1, . . .  as  a  result.  In  a  learning  context,  we  will  always 
view  the  last  output,  produced  after  presentation  of  the  complete  sequence  u,  as  the  output  value 
/(«)  assigned  by  the  hypothesis  f  ^  T  to  the  input  sequence  u. 


U  =  U\  .  .  .Uk 


dynamical 

system 


Intuitively,  by  limiting  the  memory  and  power  (dynamic  order,  number  of  adjustable  pzira- 
meters )  of  the  elements  of  .F,  and  analyzing  behavior  for  longer  and  longer  input  sequences,  one 
is  able  to  focus  on  the  properties  that  truly  reflect  the  dependence  of  /(u)  on  long-term  time 
correlations  in  the  data.  Although  this  paradigm  is  inspired  by  the  use  of  finite  automata  for  the 
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recognition  of  languages,  or  the  use  of  recursive  least  squares  techniques  in  statistical  problems, 
the  interest  here  is  in  nonlinear,  continuous-state,  dynamical  systems.  In  particular,  we  employ 
(scalaur-output)  recurrent  networks  eind  study  classes  such  as  introduced  in  Equation  15. 

JL 

which  consist  of  those  mappings  (R”*)  -»•  1  induced  on  inputs  of  length  k  by  each  possible 

initialized  recurrent  net  with  architecture  A. 

The  questions  Q1-Q4  posed  above  for  static  nets  can  also  be  asked,  when  suitably  interpreted, 
for  dynamic  networks.  Question  Q1  leads  to  the  topic  of  approximation  of  dyncimical  systems  by 
recurrent  nets.  Work  described  in  the  PI  papers  [37]  and  [32]  deals  with  this  issue,  mentioned 
in  Section  3.4  of  the  report.  Question  Q2  has  been  researched  in  the  recent  PI  papers  [52],  [24], 
and  [27],  while  Q3  is  touched  upon  in  the  first  two  of  these  references,  and  both  are  mentioned 
in  Section  3.2.2. 

2.5  Feedback  Control  Questions 

Not  all  of  our  research  in  the  neural  network  area  deals  with  questions  motivated  by  learning 
and  generadization.  A  very  different  line  of  work  of  ours,  which  has  achieved  many  important 
results  yet  needs  further  development,  concerns  the  exploration  of  the  capabilities  of  networks  as 
controllers  for  nonlinear  systems. 

2.5.1  State-Feedback  Control 

Nonhnecir  control  is  often  mentioned  as  one  of  the  most  promising  areas  of  application  for  neural 
networks.  We  start  with  standard  paradigm:  a  system  of  controlled  differential  equations 

X  =  /(x,u),  (19) 

evolving  in  a  manifold,  where  the  right  hamd  side  is  interpreted  as  a  set  of  vector  fields  parjunet- 
erized  by  the  inputs  (see  e.g.  [5]  for  precise  definitions).  Assume  that  xq  is  an  equilibrium  state 
{f(xo,uo)  =0  for  some  control  value  «o);  then  one  of  the  most  basic  control  questions  regards 
the  existence  of  a  feedback  mapping  u  =  k{x),  with  k{xo)  =  uot  which  renders  the  equilibrium  xq 
of  the  closed  loop  system  x  =  f  [x,  k{x))  globally  asymptotically  stable.  Most  of  the  neurocontrol 
literature  attempts  to  find  such  a  feedback  in  the  form  of  a  neural  network.  (Of  course,  more 
complex  control  objectives  are  often  posed,  such  as  trticking  with  internal  stability  or  model 
reference  adaptive  control,  but  we  use  stabilization,  which  is  always  a  first  design  objective,  to 
understand  the  problems  in  the  “cleanest”  possible  setting.) 

Very  often,  the  stated  objective  is  to  obtain  feedback  laws  implementable  by  nets  of  a  special 
form,  namely  IHL  nets  (recall  Figure  3  and  Equation  (7)).  This  simple  structure  with  clearly 
identifiable  and  ttmable  parameters  is  appealing  from  an  adaptive  control  point  of  view.  Moreover, 
the  universality  results  for  fimction  approximation  would  seem  to  indicate  that  such  networks 
are  always  sufficient.  However,  this  is  very  misleading.  It  turns  out  that  IHL  nets  are  not 
“universal”  enough  for  the  implementation  of  controllers,  in  a  precise  sense  reviewed  later.  This 
was  first  pointed  out  in  the  paper  [10],  which  also  gave  a  general  theorem  showing  that  two 
hidden  layers  (and  discontinuous  activations)  are  sufficient.  (The  contribution  was  awarded  a 
citation  as  an  outstanding  paper  in  the  IEEE  Transactions  on  Neural  Networks).  Intuitively,  the 
reason  for  this  apparent  contradiction  is  that  control  problems  are  essenticdly  inverse  problems 
(one  must  solve  for  a  trajectory  satisfying  certain  boundary  conditions).  For  such,  as  pointed 
out  in  subsection  2.1.1,  more  general  types  of  networks  are  needed. 
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For  an  intuitive  understanding  of  how  this  obstruction  might  occur,  consider  the  following 
situation.  Suppose  that  the  objective  is  to  globcdly  asymptotically  stabilize  a  planar  system  with 
respect  to  the  origin  using  controllers  u=k{x)  implementable  by  IHL  nets.  It  is  easy  to  give 
examples  for  which  there  is  no  continuous  feedback  law  that  will  achieve  stabilization,  even  if 
the  original  system  is  very  simple.  Since  its  response  is  continuous,  no  IHL  net  with  continuous 
activations  would  be  able  to  accomplish  the  stated  objective.  But  what  about  allowing  discon¬ 
tinuous  <T  such  as  Heaviside  activations?  (We  ignore  for  now  questions  of  possible  nonexistence 
of  solutions  for  ode’s  with  discontinuous  right-hand  sides;  the  problem  will  be  even  more  basic.) 
It  turns  out  that  even  then  it  may  be  impossible  to  stabilize.  Indeed,  assume  that  we  know  some 
discontinuous  feedback  law  fco(x)  which  stabilizes.  It  would  appear  that  one  Ccin  then  obtain 
k{x)  simply  by  approximating  ko.  However,  as  mentioned  in  subsection  2.1.1  and  citing  [10]),  it 
is  impossible  in  general  to  approximate  a  discontinuous  ko  uniformly  by  IHL  fimctions.  But  a 
weadc  type  of  approximation  may  not  be  enough  for  control  purposes.  For  instance,  it  may  be 
the  case  that  for  each  approximant  ko  there  is  a  small  region  F  encircling  the  origin  where  the 
approximation  is  bad,  in  the  sense  the  vector  field  f{x,ko{x))  must  point  transversally  outward 
everywhere  on  F,  thus  introducing  an  obstacle  or  “barrier”  to  global  stabilization. 

A  precise  result  in  given  in  the  paper  [10]  for  discrete-time  systems  (applicable  to  continuous¬ 
time  via  sample-and-hold  control).  That  paper  exhibits  explicit  examples  of  systems  which  are 
otherwise  stabilizable  but  such  that  every  possible  feedback  implementable  by  a  IHL  net  would 
fail,  as  any  closed-loop  system  obtained  in  that  way  must  give  rise  to  a  nontrivial  periodic  orbit. 
On  the  other  hand,  it  also  shows  that  if  a  system  is  stabilizable  in  any  manner  whatsoever,  then 
it  can  also  be  stabilized  using  two-hidden  layer  nets. 

To  summarize,  if  stabilization  requires  discontinuities  in  feedback  laws,  it  may  be  the  case 
that  no  possible  IHL  net  stabilizes.  Thus  the  issue  of  stabilization  by  nets  is  closely  related  to 
the  standard  problem  of  continuous  and  smooth  stabilization  of  nonlinear  systems,  one  that  has 
attrjicted  much  research  attention  in  recent  years.  Roughly,  there  is  a  hierarchy  of  state-feedback 
stabilization  problems:  those  that  admit  continuous  solutions,  those  that  don’t  but  can  still  be 
solved  using  IHL  nets  with  discontinuous  activations,  and  more  general  ones  (solvable  with  two 
hidden  layers).  It  can  be  expected  that  an  analogous  situation  will  be  true  for  other  control 
problems.  (Perhaps  the  reason  that  experimental  neurocontrol  papers  have  reported  successes 
while  using  IHL  nets  is  that  they  almost  always  dealt  with  feedback  linearizable  systems.  In 
the  context  of  nonlinear  systems,  feedback  linearizable  ones  constitute  extremely  restricted  class, 
but  they  do  admit  continuous  stabilizers,  so  the  theoretical  obstructions  just  discussed  axe  not 
relevant.) 

2.5.2  Saturated  Linear  Systems 

The  above  limitations  of  IHL  nets  in  control  notwithstanding,  IHL  nets  can  be  shown  to  be 
perfectly  suited  to  certain  control  problems.  One  line  of  work  by  the  Pis  deals  with  the  use  of 
networks  for  the  control  of  linear  systems  subject  to  actuator  saturation.  This  is  the  situation 
modeled  by  equations  of  the  following  type: 

X  =  Ax  +  Bcr(u) 

where  A  and  B  are  as  usual  in  linear  control  theory  and  is  a  saturation  such  as  tanh  or  tt. 
It  is  often  said  that  saturation  is  the  most  commonly  encountered  nonlinearity  in  control  engin¬ 
eering,  so  the  development  of  techniques  for  the  control  of  such  systems  is  obviously  of  great 
interest.  Our  work  starts  with  the  result  by  Fuller,  around  1970,  that  it  is  in  general  impossible 
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to  globally  stabilize  the  origin  of  such  systems  by  means  of  linear  feedback  u=Fx  -for  a  more 
general  result  along  those  lines,  see  [4]-  even  if  the  system  is  open-loop  globally  controllable  to 
the  origin.  This  suggests  the  obvious  question  of  searching  for  nonlinear  feedback  laws  u=k{x) 
that  achieve  such  stabilization,  and  in  particular  for  nicely  behaved  and  easily  implementable 
controllers  (in  contrast  to  optimal  control  techniques,  which  result  in  highly  irregular  feedback). 
In  1990  we  proved  that  smooth  stabilization  is  always  possible.  Motivated  by  our  paper.  Teel 
in  1992  showed  that  single-input  multiple  integrators  can  be  stabihzed  by  feedbacks  which  are 
themselves  compositions  of  linear  fimctions  and  iterated  saturations.  We  were  able  soon  after 
to  extended  Teel’s  result  to  arbitrary  systems  as  above  which  are  open-loop  asymptotically  con¬ 
trollable  (equivalently,  the  rank  of  the  matrix  [XI  -  A,B]  is  n  for  all  purely  imaginary  A,  and  ^4 
has  no  eigenvalues  with  positive  real  part).  More  recently,  we  showed  that  one  can  always  use 
IHL  nets  for  implementing  such  feedback;  see  [18].  The  more  recent  work  [29]  extended  these 
results  to  discrete  time  systems.  As  an  example,  the  paper  [43]  developed  an  explicit  design 
concerning  a  longitudinal  flight  model  of  an  F-8  aircraft,  with  saturations  on  the  elevator  rate, 
and  tested  the  obtained  controller  on  the  original  nonlinear  model.  We  chose  the  F-8  example 
since  all  parameters  and  typical  trim  conditions  are  publicly  available,  zind  the  model  has  been 
often  used  as  a  test  case  for  aircraft  control  designs.  The  procedure  we  followed  consisted  of 
first  linearizing  about  an  operating  point  and  then  constructing  a  globally  stabilizing  controller 
for  the  resulting  linearization,  with  respect  to  this  given  trim  condition,  following  the  steps  in  our 
papers.  Finally  we  proceed  to  compare  the  performance  of  the  controller  -applied  to  the  original 
nonhneax  airplane  model  and  starting  reasonably  far  from  the  desired  operating  point—  with  the 
“naive”  controller  that  would  result  from  applying  a  linear  feedback  law  which  would  stabilize 
in  the  absence  of  saturations.  The  objective  of  the  work  was  to  show  that  the  calculations  in  the 
abstract  proofs  can  indeed  be  carried  out  explicitly  (though  this  an  extremely  simple  case  com¬ 
pared  to  the  generality  of  the  results  in  our  papers)  and  moreover,  to  study  the  performtince  of 
the  resulting  controller  when  used  for  the  original,  nonlinear,  model.  Although  it  turned  out  that 
our  design  provided  unticceptable  performance,  the  improvement  with  respect  to  linear  feedback 
was  remarkable,  and  we  consider  the  results  to  be  encouraging  and  indicating  the  usefulness  of 
further  work  along  this  direction. 

2.6  Recurrent  Nets  as  Dynamical  Systems 

So  far  we  have  introduced  in  this  overview  several  questions  related  to  the  aireas  of  learning  and 
of  control.  The  former  area  provides  a  methodology  for  the  study  of  different  aspects  of  both 
feedforward  and  recurrent  neural  nets,  and  leads  to  the  study  of  questions  Q1-Q4  on  approxim¬ 
ation,  learning  dimensions,  error  surfaces,  and  computational  complexity.  Feedback  control  is 
an  important  application  area,  and  leads  to  asking  about  the  existence  of  networks  implement¬ 
ing  regulation  laws  with  particular  properties.  We  now  look  at  third  area,  namely  the  study  of 
properties  of  recurrent  networks  as  dynamical  systems  with  inputs  and  outputs.  This  leads  to 
questions  of  observability,  minimality,  identification,  and  computational  power.  It  turns  out  — 
perhaps  surprisingly  given  the  universal  approximation  properties  possessed  by  recurrent  nets 
—  that  such  questions  can  be  treated  successfully  to  a  large  extent,  in  contrast  to  other  classes 
of  nonhnear  systems  more  classically  considered. 

Observability,  Identification 

Recall  (see  e.g.  [5])  that  two  states  of  a  system  are  distinguishable  if  it  is  possible  to  tell  them 
apart  on  the  basis  of  input  output  experiments.  The  Pis  wrote  many  papers  since  the  late  1970s 
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on  such  issues:  references  can  be  found  in  the  ICM  invited  paper  [3o].  Observability  means  that 
every  pair  of  distinct  states  must  be  distinguishable.  Unobservability  constitutes  a  fundamental 
limitation  to  the  existence  of  an  unbiased  estimator  of  the  internaJ  states  of  a  network.  And  from 
a  synthesis  rather  than  analysis  point  of  view,  the  study  of  observability  is  one  half  of  the  study  of 
redundancy:  if  there  are  indistinguishable  states,  one  may  in  principle  eliminate  the  redundancy 
and  build  a  smadler  network  which  achieves  exactly  the  same  performance  objectives  as  specified 
in  terms  of  input/output  behavior.  (The  other  half  relates  to  the  study  of  reachability.) 

A  formalization  is  as  follows.  An  n-dimensional  recurrent  net  S  =  {A,B.C.cr)  is  observable 
if  for  each  distinct  ^  G  R”  and  G  R"  it  holds  that 


(where  is  the  initialized  recurrent  net  with  initial  state  ^).  More  precisely,  one  should  define 
discrete- time  or  continuous- time  observability,  since  the  definition  of  the  input /output  mapping  A 
depends  on  the  interpretation  of  the  equations  as  difference  or  differential  equations  respectively. 
It  turns  out,  interestingly,  that  there  is  a  common  algebraic  characterization  that  applies  in  both 
cases.  Assume  that  B  satisfies  this  generic  property:  no  row  is  identically  zero  and 

for  each  i  ^  j,  there  exists  some  k  such  that  \bi^k\  ^  >  (20) 

and  that  {A,B,C.(t)  is  a  homogeneous  net  whose  activation  is  the  standard  sigmoid  in  Equa¬ 
tion  (1).  Let  OciA,  C)  be  the  largest  A-invariant  coordinate  subspace  included  in  ker  C,  where  by 
a  coordinate  subspace  we  mean  any  subspace  of  R”  invariant  under  all  the  coordinate  projections 
TTj  :  R”  ->  R”,  TTiCj  =  SijCi  {eui  =  1, ...,n  denote  the  cjinonical  basis  elements  in  R”).  It  was 
shown  in  [15]  that  (A,  B,C,<t)  is  observable  if  and  only  if  ker  A  fl ker  C  =  Oc{A,C)  =  0.  This  is 
a  very  simple  characterization,  easy  to  check  Eilgorithmically.  One  obtains  as  a  corollary,  under 
the  above  assumption  on  B,  that  the  net  is  observable  if  the  pair  of  matrices  (A,  C)  is  observable 
in  the  usucil  linear-algebrciic  sense  (cf.  [5]). 

We  have  also  studied  in  the  past,  and  intend  to  continue,  work  on  parameter  identifiability, 
that  is  to  say,  the  possibility  of  recovering  the  entries  of  the  matrices  A.  B.  and  C,  and,  for 
initialized  systems,  also  the  initial  state,  from  the  input /output  behavior  of  the  net.  Assume  that 
we  restrict  attention  to  homogeneous  nets  that  satisfy  the  following  conditions.  The  activation  a 
is  infinitely  differentiable  around  zero,  and  satisfies  the  following  mild  nonlinearity  assumption: 
cr'(O)  ^  0  and  a^’)(0)  #  0  for  some  q  >  2.  (So  for  analytic  fimctions.  we  are  just  asking 
that  <7  be  nonlinear  and  nonsingular  at  zero.)  Furthermore,  we  assume  that  B  satisfies  (20) 
and  no  entry  of  B  vanishes,  and  that  the  triple  of  matrices  (A, 5,(7),  A  G  B  G 

C  G  is  canonical.  (That  is,  observable  and  controllable,  as  in  [5],  section  5.5;  this  is  a 

generic  set  of  triples,  in  the  sense  that  the  entries  of  those  which  do  not  satisfy  the  property 
form  a  proper  Zariski-closed  subset  of  R"* +"»"+"?.)  Then,  in  [14],  a  general  result  was  proved, 
which  in  particular  implies  that  if  E  =  (A, 5,  C, 0,5^”^)  and  E'  =  {A',B',C'.0,d\^^^)  have  same 
input/output  behaviors  A’s,  then  a  =  and  the  two  triples  of  matrices  are  sign-perTnutation 
equivalent  meaning  that  there  exists  a  nonsingular  matrix  T  such  that  = 

B\CT  =  C',  and  T  is  a  permutation  matrix.  That  is,  the  two  networks  are  exactly  the  same 
except  for  a  relabeling  of  neurons. 

This  is  a  rather  surprising  result,  which  says  that  “function  (input /output  behavior)  de¬ 
termines  form  (internal  structure)”.  We  recall  that  for  linear  systems  the  analogous  result  is 
far  weaker,  and  only  allows  concluding  equivalence  modulo  GL(n,  K)  as  opposed  to  modulo  a 
discrete  group  (cf.  [5]).  The  proof  starts  with  the  obvious  idea  of  considering  the  parameters 
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as  extra  constant  states  and  computing  the  observables  for  the  extended  systems;  this  involves 
considering  iterated  Lie  deri\'atives  of  observations  under  the  vector  fields  defining  the  system  and 
is  a  routine  approach  in  nonlinear  control.  However,  and  this  is  the  nontrivial  part,  extraw^ting 
enough  information  from  these  Lie  derivatives  is  not  entirely  easy,  and  the  proof  centers  on  that 
issue.  Moreover,  if  the  activation  is  analytic,  the  response  to  a  single  long-enough  input  function 
is  theoretically  enough  for  identification  of  all  parameters,  in  analogy  to  the  use  of  impulse- 
responses  in  linear  systems  theory.  More  recent  work  showed  that,  under  additional  assumptions 
on  the  activation  (conditions  which  are  satisfied  when  cr  is  the  standard  sigmoid,  for  example), 
one  can  obtain  anzdogous  results  even  for  nonzero  and  distinct  initial  states  (and  one  concludes 
that  these  states  are  also  in  the  same  orbit  under  the  group  action  described  above);  see  [41]  for 
preliminary  work  (a  full  paper  is  still  under  preparation,  as  we  wish  to  obtaun  as  general  a  result 
as  possible). 

A  related  area  is  that  of  what  we  call  Fourier-recurrent  neural  networks.  These  are  networks 
with  activations  from  the  set  {sin  x,  cos  x} .  (More  precisely,  we  prefer  to  equivalently  use  complex 
state  variables  and  the  activation  e“,  but  this  is  just  a  question  of  mathematical  convenience.)  In 
the  paper  [23],  we  presented  a  closed-form  procedure,  not  involving  any  nonlinear  optimization, 
for  the  identification  of  the  entries  of  A,  B.  C  and  the  initial  state  for  such  nets.  The  procedure 
is  based  on  Hankel-matrix  techniques  that  are  classical  in  the  context  of  linear  recurrences  and 
their  mtdtivariable  extensions  developed  in  control  theory  (see  a  detailed  discussion  in  Chapter 
5  in  [5])  as  well  as  in  many  other  areas  such  as,  coding  theory,  for  decoding  BCH  codes,  and  in 
learning  theory,  for  sparse  polynomial  interpolation. 

Computational  Abilities;  Computational  Complexity 

The  last  general  subject  that  we  wish  to  mention  in  this  report  deals  with  questions  that  are 
somewhat  more  abstrjict  and  cire  framed  in  the  language  of  theoretical  computer  science.  The 
topic  is  the  exploration  of  the  ultimate  capabilities  of  recurrent  nets  viewed  as  analog  computing 
devices.  This  area  is  a  fascinating  one,  but  very  difficult  to  approach.  Part  of  the  problem  is 
that,  much  interesting  work  notwithstanding,  analog  computation  is  hard  to  model,  as  difficult 
questions  about  precision  of  data  and  readout  of  results  are  immediately  encountered  -see  the 
references  in  our  papers  cited  below. 

In  a  series  of  papers  y  one  of  the  Pis  2md  [17],  as  well  as  the  April  28,  1995  issue  of  Science, 
and  the  “handbook”  article  [34]),  we  took  the  point  of  view  that  artificial  neural  nets  provide  an 
opportunity  to  reexamine  some  of  the  foundations  of  analog  computation  from  the  new  perspective 
edforded  by  an  extremely  simple  yet  surprisingly  rich  model,  in  a  context  where  techniques 
from  dynamical  systems  theorj'  interact  naturaJly  with  more  standard  notions  from  theoretical 
computer  science.  For  recurrent  nets  with  activation  the  piecewise  linear  fimction  tt  in  Equation  2, 
we  derived  results  on  deterministic  versus  nondeterministic  computation,  and  related  the  study 
to  standard  concepts  in  complexity  theoiy.  Perhaps  the  previous  work  closest  in  spirit  is  that  on 
rezJ-number-based  computation  started  by  Blum,  Shub,  and  Smale  (“BSS”  model).  In  contrast 
to  that  line  of  work,  however,  recurrent  nets  do  not  incorporate  discontinuous  (and  hence  highly 
nonrobust),  infinite  precision  “if-then-else"  decisions,  nor  can  discrete  results  be  read  out  of  the 
system  through  infinite  precision  measurements.  Thus  recurrent  nets  are  more  restricted  than 
the  BSS  model. 

One  of  the  most  unexpected  conclusions  was  that,  at  least  within  the  formalism  of  analog 
computation  proposed  there,  recurrent  neural  nets  (with  activation  tt)  are  a  universal  model,  in 
much  the  same  manner  as  Turing  machines  are  for  classical  digital  computing.  It  was  shown 
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that  no  possible  ‘‘analog  computer”  -in  a  sense  precisely  defined-  could  ever  have  more  power 
-again  in  a  precisely  defined  sense-  up  to  polynomial  time  speedups.  Particularly  satisfying  the¬ 
oretically  is  the  fact  that  the  most  natural  categories  for  the  \‘alues  of  weights  correlate  perfectly 
with  different  natural  subclasses  of  computing  devices,  and  can  be  summarized  in  an  extremely 
elegant  fashion:  integer  weights  correspond  to  finite  automata,  rational  to  Turing  machines, 
and  real  to  arbitrary  “P/poly”  computation  (the  well-studied  Karp-Lipton  class  consisting  of 
nonuniform  polynomiaJ-size  circuits  or  equivalently  sparse-oracle  Turing  Machines).  (Further¬ 
more,  we  showed  in  [38]  how  to  characterize  computations  by  networks  having  weights  in  cenain 
Kolmogorov-complexity  definable  classes  in  between  rationals  and  reals.) 

Another  tmzinticipated  -and  intriguing-  conclusion  was  that  the  class  NP  of  nondeterministic 
polynomial-time  digital  computation  is  not  included  in  what  can  be  computed  in  polynomial 
timp  with  analog  devices  (this  is  proved  under  standard  assumptions  of  the  “P^NP”  type).  Thus 
the  solution  of  combinatorial  problems  using  analog  devices  may  be  subject  to  the  same  ultimate 
computational  obstructions,  for  large  problem  sizes,  as  with  distal  computing. 

The  work  on  computation  capabilities  turns  out  to  be  relevant  to  the  study  of  “hybrid”  control 
systems  which  combine  automata  and  linear  systems,  or  saturation  devices  in  feedback  loops.  For 
example,  the  problem  of  determining  if  a  dynamical  system  x{t  +  1)  =  a^”^(Ax(t))  ever  reaches 
an  equilibrium  point,  from  a  given  initial  state,  is  shown  to  be  effectively  undecidable  (at  least 
for  c=ir)  as  a  consequence  of  the  established  universality.  (In  Hopfield-type  nets,  when  deahng 
with  content-addressable  retrieval,  the  initial  state  is  taken  as  the  “input  pattern”  and  the  final 
state,  if  there  is  convergence  in  finite  time,  as  a  class  representative.)  The  implications  of  this 
result  to  \‘arious  questions  in  control  theory  and  in  particular  hybrid  systems  are  discussed  in 
some  detail  in  the  papers  [51]  and  [36]. 
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3  Some  More  Details 

In  this  part  of  the  report,  we  have  chosen  to  provide  some  more  detmls  on  a  few  selec¬ 
ted  subtopics,  especially  those  represented  in  very  recent  w'ork  and  hence  possibly  less  eas¬ 
ily  available.  Once  more  we  remind  the  reader  that  the  Web  page  found  at  the  URL: 
http://www.raath.rutgers.edu/sontag/  provides  access  to  a  substantial  number  of  papers 
by  the  Pis.  to  be  consulted  for  details  and  a  mathematically  rigorous  presentation. 

3.1  Error  Functions 

Here  we  deal  with  question  Q4:  “what  axe  the  properties  of  the  error  function  being  minimized 
in  (17)?”.  for  hypotheses  classes  of  the  form  T  =  This  is  motivated  in  our  development 
by  the  risk  minimization  solution  of  the  learning  problem,  but  it  is  of  course  what  is  done  ex¬ 
perimentally,  not  necessarily  with  any  theoretical  justification,  in  standard  neural  net  practice: 
given  a  parametric  form,  find  parcimeters  so  that  the  error  on  the  trcdning  data  is  minimized. 
Numerical  algorithms  based  on  gradient  descent  (so-called  “backpropagation”)  are  usually  em¬ 
ployed  for  the  minimization.  Thus  it  is  necessary  to  study  the  points  at  which  the  derivative 
of  this  error  function  may  become  zero,  and  more  particularly  the  location  of  possible  local  but 
nonglobal  minima.  Among  PI  work  in  this  direction  are  the  references  [7],  [9],  [50],  [22].  We  now 
summarize  some  facts  from  the  last  of  these  papers.  The  emphasis  of  this  past  work  was  on  IHL 
nets,  which,  as  discussed  in  Section  2.1.1,  are  those  most  commonly  foimd  in  applications. 

By  a  IHL  architecture  we  mean  an  homogeneous,  scalar-output,  single  hidden  layer  architec¬ 
ture  with  fully  connected  layers,  that  is,  A  —  (£S£^)  and  P  and  J*,  f  =  1,2  are  both  matrices 
with  all  entries  equal  to  one.  If  A  has  n  neurons  and  activation  a,  then  the  nets  with  archi¬ 
tecture  A  are  precisely  the  possible  IHL  nets  with  n  neurons  and  activation  a.  These  compute 
functions  as  in  Equation  (7),  can  be  represented  diagrzunmatically  as  in  Figme  3,  and  the  para¬ 
meter  space  has  dimension  r  =  n(m  2)  +  1,  where  p  G  R’’  represents  the  scalars  cq,  . . . ,  c„  and 

oi, _ o„,  and  the  m-row  vectors  Ai,...,An-  For  purposes  of  this  discussion,  we  will  assume 

that  cr(i)  =  tanh(i)  (or  up  to  rescalings,  the  standard  sigmoid),  which  is  the  activation  most 
commonly  used  in  experimental  work,  and  we  assume  that  n  has  been  fixed.  (The  study  of 
good  choices  for  n  leads  to  “structural  risk  minimization”  studies,  about  which  we  do  not  have 
2my thing  to  contribute  at  this  time.) 

There  are  portions  of  the  parameter  space  that  give  rise  to  degeneracies.  For  instance,  if  one 
coefficient  Cj  (t  0)  vanishes,  then  the  input /output  function  /  is  independent  of  the  values  of 
the  corresponding  Ai  and  Oj.  If  some  Aj  =  0  then  the  corresponding  term  is  constant  and  can  be 
absorbed  into  co-  If  for  two  different  z  ^  j  it  is  the  case  that  Ai  =  Aj  and  a,  =  Uj,  then  the  terms 
corresponding  to  i  and  j  can  be  combined,  and  only  the  sum  Cj  -I-  Cj  is  relevant,  resulting  eilso 
in  a  loss  of  dimensionality.  Simil«irly,  since  tanh  is  an  odd  function,  if  (A,,  =  —{Aj,aj)  then 

terms  can  be  combined  as  well.  Thus  a  more  natural  parameter  space  is  the  subset  consisting  of 
all  the  a/s,  Cj’s,  and  Ay’s  for  which  these  exceptional  situations  do  not  occur.  We  will  restrict 
attention  to  parameters  in  this  subset. 

Assume  given  a  training  or  regression  data  set  (“labeled  sample”)  i/  =  (ui,yi),  •  •  • ,  («s)  Vs)) 
where  we  interpret  the  it,'s  as  input  vectors  (“regressors”  in  statistical  terms)  and  the  scalars 
j/i’s  as  targets  or  response  vectors  desired  for  the  respective  Uj’s.  The  (regression)  problem  is 
that  of  minimizing  (typically  by  means  of  steepest  descent  or  other  local  search  techniques)  the 
quadratic  loss,  error,  or  “risk”  Emp  (/,  u)  over  all  /  =  /(p,  •)  E  Pa>  that  is  to  say,  over  all  p  6  R” . 
We  write  E''{p)  instead  of  Emp  {/{p,  •)>  ^)  order  to  emphasize  the  dependence  of  this  quantity 
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on  the  paxameters,  once  that  the  labeled  sample  has  been  fixed.  Thus,  the  problem  is  to  study 
the  function 

«=i 

with  domain  the  above  parameter  set. 

Critical  Point  Analysis 

It  has  been  often  remarked  that,  even  for  extremely  simple  cases  (such  as  Tr=l  and  supposing  that 
the  inputs  tire  binary  vectors)  there  arise  critical  points  associated  to  non-global  local  minima, 
and  thus  the  study  of  the  set  of  critical  points  of  E"  has  been  frequently  put  forward  as  a  research 
topic.  In  this  context  it  has  also  been  obser\'ed  several  times  that  — as  with  other  least-squares 
problems —  pathological  behavior  will  depend  heavily  on  the  training  sets  not  being  in  “general 
position”  in  appropriate  senses  of  probability  or  topological  density.  In  the  paper  [22]  we  obtained 
several  characterizations  of  the  critical  set. 

One  of  the  main  results  given  in  that  paper  was  that  the  set  of  critical  points  is  finite,  and 
in  particulcir  less  than  (assuming  that  there  are  enough  samples  to  make  the  problem  not 

underdeter  mined,  specifically  that  s  >  2n(m  -I-  2)  -I-  3,  and  for  generic  regression  data).  If  the 
munber  of  samples  scales  linearly  on  the  number  of  nodes  n,  and  assuming  a  constant  input 
dimension,  an  upper  bound  of  the  type 

2cn* 

results.  A  lower  bound  of  the  type  also  holds,  due  to  symmetries  in  the  problem:  any 

exchange  among  the  n  terms  in  the  sum  preserves  /. 

We  review  next  the  organization  of  the  proof,  to  illustrate  the  types  of  techniques  used.  We 
first  showed  that  analytically  parameterized  classes  of  functions  can  be  identified  generically  on 
the  basis  of  just  2r  1  samples,  if  r  is  the  number  of  free  parameters.  This  part  of  the  paper 
depended  on  basic  faults  about  real-analv’tic  functions,  adl  consequences  of  the  basic  stratification 
theorems  for  subainalytic  sets  due  to  Hironaika.  Lojasiewicz,  and  others  (see  [1]  for  an  introduction, 
with  complete  proofs,  to  the  needed  results  on  strattification  theory).  We  then  studied  critical 
points  for  least-squares  error  criteria:  this  step  relied  upon  elementary  differentiail  topology, 
basically  the  elementairy  Morse  theory  found  in  most  textbooks,  but  applied  only  after  yet  another 
application  of  analytic  set  theory.  We  then  showed,  for  generic  analytic  problems,  the  countability 
of  the  set  of  fimctions  giving  rise  to  critical  parameter  vailues.  aind  a  refinement  showing  that  this 
set  is  in  fact  finite,  provided  that  the  parametric  class  of  fmctions  be  definable  logically  in  terms 
of  the  exponential  and  certain  other  speciad  analytic  fmctions;  this  was  based  on  the  recent  work 
in  logic,  dealing  with  o-minimal  logical  theories,  due  to  Wilkie,  MacIntyre,  van  den  Dries,  Miller, 
Gabrielov,  and  others.  Finally,  we  specialized  to  single  hidden  layer  networks,  where  one  cam  use 
results  on  determination  of  parameters  (as  given  in  [31])  in  order  to  obtain  finiteness  of  the  set 
of  critical  pau’ameters.  Once  that  finiteness  is  known,  one  may  use  Khovanskii  estimates  on  Betti 
numbers  for  sets  defined  by  exponentiad/aJgebraic  expressions  in  order  to  bomd  the  number  of 
connected  components,  since  the  set  in  question  is  now  known  to  be  am  embedded  submanifold, 
and  this  bound  is  of  course  just  the  number  of  points,  giving  the  estimate  mentioned  above. 

Convergence  to  True  Parauneters 

In  the  above  context  of  minimization  of  the  a  loss  criterion,  one  needs  to  mderstand  the  conver¬ 
gence  of  gradient  descent  procedures  (and  of  related  “on  line”  methods,  which  only  approximate 
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the  gradient  by  considering  one  term  at  a  time).  In  this  context,  we  mention  that  in  genertil 
the  existence  of  nonglobal  local  minima  means  of  course  that  no  global  convergence  may  be  ex¬ 
pected.  For  a  special  case  treated  in  [9].  it  is  possible  to  show  that  gradient  descent  converges 
globally  (we  omit  the  details  here)  but  in  the  general  case  the  existence  of  such  obstructions  is 
unavoidable.  Local  convergence  will  hold  if  the  number  of  inputs  is  large  enough  (the  bound  given 
above  is  sufficient),  and  in  that  case  one  may  ask  the  following  question:  if  the  training  data  was 
indeed  generated  by  a  IHL  net  (seen  as  a  “black  box"),  does  gradient  descent  lead  to  the  true 
parameters?  This  issue,  for  the  generic  parameter  space  mentioned  above,  was  posed  (in  slightly 
different  language,  but  the  same  mathematical  problem)  in  the  late  1980s  by  Hecht-Nielsen.  A 
positive  emswer  follows  from  the  results  presented  in  (2],  for  the  special  case  of  the  actiration 
tanh.  The  more  recent  work  [31],  mentioned  above,  extended  this  to  a  wide  class  of  real-analytic 
fimctions,  using  complex- variable  techniques  rather  than  the  asymptotic  Einalysis  of  [2]. 

3.2  Estimation  of  Learning  Dimension 

Here  we  deal  with  question  Q2:  “does  :F  have  finite  pseudo-dimension?” ,  for  hypotheses  classes 
of  the  form  T  =  Recall  the  PAC  learning  formulation  discussed  in  Section  2.4.1:  a  set  of 
labeled  training  samples  is  provided,  and  a  network  must  be  obtained  which  is  then  expected 
to  (“probably  and  approximately”)  correctly  classify  previously  unseen  inputs.  In  this  context, 
a  central  problem  is  to  estimate  the  amount  of  training  data  needed  to  guarantee  satisfactory 
learning  performance,  and  this  in  turn  leads  to  the  search  for  pseudodimension  estimates. 

We  first  define  the  special  case  of  VC  dimension,  and  then  pseudo-  (or  Pollard)  dimension. 
The  concept  of  VC  dimension  is  classically  defined  in  terms  of  abstract  concept  classes;  we  review 
that  first  and  then  interpret  in  terms  of  functions.  As  in  Section  2.4,  we  are  given  an  input  set 
U.  Assume  also  given  a  family  of  subsets  C  of  U,  called  the  set  of  “concepts.”  A  subset  Uq  CU 
is  said  to  be  shattered  (by  the  class  C)  if  for  each  subset  B  CUq  there  is  some  C  £C  such  that 
B  =  Cf]^-  The  VC  dimension  is  then  the  largest  possible  positive  integer  k  (possibly  -l-oo) 
so  that  there  is  some  Uq  Q  U  of  cardinality  k  which  can  be  shattered.  An  equivalent  manner 
of  stating  these  notions,  somewhat  more  suitable  for  our  purposes,  proceeds  by  identifying  the 
subsets  of  Uo  with  Boolean  functions  from  Uq  to  {0, 1):  to  each  such  Boolean  fimction  4>  there 
is  an  associated  subset,  namely  {i  €  Uo\<l>ix)  =  1},  and  conversely,  to  each  set  B  C  X  one 
can  associate  its  chturacteristic  function  (f>B  defined  on  the  set  Uq.  Similarly,  we  can  think  of  the 
sets  C  G  C  as  Boolean  functions  on  U  and  the  intersections  C  f)  the  restrictions  of  such 
functions  to  Uq.  Thus  we  restate  the  definitions  now  in  terms  of  functions. 

Given  the  set  U,  and  a  subset  Uq  of  U,  a  dichotomy  on  Uq  is  a  function  5  :  Uq  {0, 1}. 
Assume  given  a  class  T  of  functions  W  -)•  {0, 1},  to  be  called  the  class  of  classifier  functions.  The 
subset  Uq  CU  is  shattered  by  T  if  each  dichotomy  on  Uq  is  the  restriction  to  Uq  of  some 
The  Vapnik-Chervonenkis  (VC)  dimension  vc  {!F)  is  the  supremum  (possibly  infinite)  of  the  set 
of  integers  k  for  which  there  is  some  subset  Uq  CU  of  cardinality  k  which  can  be  shattered  by 
T. 

By  abuse  of  terminology,  when  T  is  a  class  of  real- valued  functions  (as  with  maps  JFa  induced 
by  fixed  neural  architectures),  by  vc  (J^)  one  means  vc  {'H{T)),  where  ^.(T)  =  {H  o  /,  /  €  J”}, 
and  H  is  the  Heaviside  function  introduced  earlier.  Thus  vc  (T)  quantifies  the  amount  of  training 
data  needed  for  reliable  prediction  when  using  functions  in  .F  as  classifiers:  a  positive  output 
mpan*;  -‘accepted”  and  a  negative  output  to  “rejected”.  This  is  consistent  with  the  standard 
use  of  neural  networks  as  classification  machines.  For  binary-valued  hypotheses  classes,  the  VC 
dimension  provides  tight  estimates  (as  opposed  to  merely  upper  bounds)  of  the  sample  complexity 
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discussed  in  Section  2.4.1.  assuming  we  are  interested  in  learning  with  respect  to  the  class  of 
all  possible  (Borel)  probability  measures:  under  certain  simple  nontriviality  assumptions  and  for 
0  <  £  <  5, 

>  meix|^-^ln  ,vc(.F)(l -2(e(l  —  5)  +  (J)) 

Together  with  the  upper  bounds  reviewed  earlier,  this  says  roughly  that  sample  complexity  is 
proportional  to  VC  (.?^). 

One  may  also  define  a  measure  useful  for  learning  real- valued  functions,  as  discussed  in 
Section  2.4.1.  The  notion  of  pseudo-dimension  is  used  in  the  framework  developed  by  Haussler 
and  based  on  previous  work  by  Vapnik,  Chervonenkis.  and  Polleird.  Given  a  class  of  functions 
from  U  to  we  may  introduce,  for  each  f  £  E,  the  function 

Qf  :  :  (z,  y)  sign  (/(x)  -  y) 

as  well  as  the  class  Eo  consisting  of  all  such  g/.  The  pseudo- dimension  of  E  is  the  extension  of 
VC  dimension  to  non-binairy  function  classes  giveii  by  the  formula  pd(J')  :=  vc  (.Fo).  We  will 
focus  here  on  VC  dimension  questions,  that  is,  binary  classification,  but  many  of  our  results  (the 
upper  boimds  by  amdogous  proofs,  and  the  lower  boimds  as  corollaries)  apply  also  to  pseudo¬ 
dimension  estimates.  (One  should  observe  that,  in  contrast  to  the  binary  case,  finiteness  of 
pseudo-dimension  is  merely  a  sufficient,  not  a  necessary,  condition,  for  leamability.  On  the  other 
hand,  the  actual  proofs  provide  also  lower  bounds  for  so-called  scale-sensitive  dimensions,  which 
for  the  continuous  output  case  do  provide  necessary  conditions.  Another  important  observation 
is  t.liat.  boimdedness  of  the  output  space  is  a  requirement  in  the  connection  between  pseudo¬ 
dimension  and  PAG  learnability,  which  means  that  results  must  either  assume  a  “squashing”  of 
the  output,  or  consider  bounded  loss  functions  such  as  |yi  —  y2p  /(I  +  |yi  —  y2|^)  ) 

3.2.1  VC  Dimension  for  Feedforward  Nets 

The  well-known  work  of  Cover  in  1968  and  Baum  and  Haussler  in  1989  deeilt  with  the  compu¬ 
tation  of  VC  {E)  when  the  class  E  consists  of  networks  built  up  from  hard-threshold  activations 
and  having  r  weights;  they  showed  that  vc  (.T^)  =  0(r  log  r).  Conversely,  Maass  showed  in  1993 
that  there  is  zdso  a  lower  bound  of  this  form.  It  would  appear  that  this  definitely  settled  the 
VC  dimension  (and  hence  also  the  sample  size)  question.  However,  that  estimate  assumed  an 
architecture  based  on  hard-threshold  (Heaviside)  activations.  In  contrast,  the  usually  employed 
gradient  descent  learning  algorithms  (“backpropagation”  method)  rely  upon  continuous  activa¬ 
tions,  that  is,  neurons  with  graded  responses.  As  pointed  out  in  [8],  the  use  of  analog  activations, 
which  allow  the  passing  of  rich  (not  just  binary)  information  among  levels,  may  result  in  higher 
memory-  capacity  as  compared  with  threshold  nets.  This  has  serious  potential  implications  in 
learning,  essentially  because  more  memory  capacity  means  that  a  given  function  /  may  be  able 
to  “memorize”  in  a  “rote”  fashion  too  much  data,  and  less  generalization  is  therefore  possible. 
Indeed,  the  paper  [11]  showed  that  there  are  conceivable  (though  not  very  practical)  neural  ar¬ 
chitectures  with  extremely  high  VC  dimensions.  Thus  the  problem  of  studying  vc  (E)  for  analog 
networks  is  an  interesting  and  relevant  issue.  Two  important  contributions  in  this  direction  were 
the  papers  by  Maass  in  1993  and  by  Goldberg  and  Jerrum  in  1993.  which  showed  upper  bounds 
on  the  \*C  dimension  of  networks  that  use  piecewise  polynomial  activations.  The  paper  [39]  intro¬ 
duced  techniques  from  model  theory  and  analytic  function  theory  to  show  that  the  VC  dimension 
is  finite  for  large  classes  of  activations  (including  the  standard  sigmoid,  in  particular),  and  follow¬ 
ing  along  this  direction  a  paper  by  MacIntyre  and  Karpinski  recently  succeeded  in  establishing 
an  exphcit  upper  boimd  for  the  staindard  sigmoid  as  well  as  other  Pfaffian  activations. 
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Goldberg  amd  Jemim  established  for  piecewise  polynomial  acti\'ations  an  upper  bound  of 
O(r^),  where,  as  before,  r  is  the  number  of  weights.  However  it  was  an  open  problem  in  a  1994 
survey  by  Maass  if  there  is  a  matching  lower  bound  for  such  networks,  and  more  generally  for 
arbitrar\’  continuous-activation  nets.  It  could  have  been  the  case  that  the  upper  boimd  O(r^)  is 
merely  an  artifact  of  the  method  of  proof,  and  that  reliable  learning  with  continuous-activation 
networks  is  still  possible  with  fcir  smaller  sample  sizes,  proportional  to  O(rlogr). 

But  this  is  not  the  case,  and  in  the  papers  [25],  [53]  we  answered  Maass’  open  question  in  the 
afl&rmative.  As  in  [11],  we  say  that  the  activation  a  :  E  -4  R  is  sigmoidal,  or  a  sigmoid,  if:  (1)  a  is 
differentiable  at  some  point  xq  where  (t'(io)?^0  and  (2)  limi_>_oo  (^{x)  =.0  and  limi_>+oo  =  1 
(the  hmits  0  and  1  can  be  replaced  by  any  distinct  numbers).  (Note  that  the  first  condition  rules 
out  the  Heaviside  cictivation).  Then  there  are  architectures  with  arbitrary  large  numbers  of 
weights  r  and  VC  dimension  proportional  to  r^.  The  proof  relies  on  first  showing  that  networks 
consisting  of  two  types  of  activations,  Heavisides  and  linear,  already  have  this  power.  This  is  a 
somewhat  surprising  result,  since  purely  Unear  networks  result  in  VC  dimension  proportional  to 
r,  and  purely  threshold  nets  have,  as  per  the  results  quoted  above,  VC  dimension  bounded  by 
rlogr.  The  desired  result  on  continuous  activations  is  then  obtained,  approximating  Heaviside 
gates  by  cr-nets  with  large  weights  and  approximating  linear  gates  by  c-nets  with  small  weights 
(sigmoids  are  “locally  Unear  and  globaUy  thresholds”).  A  number  of  variations,  dealing  with 
Boolean  inputs,  or  weakening  the  assumptions  on  a,  are  also  discussed  in  [25],  whose  last  section 
also  describes  an  interpretation  of  the  results  in  terms  of  threshold-only  networks  with  “shared” 
weights.  Our  result  appUes,  as  a  very  special  case,  to  the  standard  sigmoid  1/(1  +  e“*). 

3.2.2  VC  Dimension  for  Recurrent  Nets 

We  explained  in  Section  2.4.3  the  interest  in  h)rpotheses  classes  whose  inputs  u€U  are  themselves 
finite  sequences,  and  specifically  we  wish  to  look  at  classes  (see  Equation  (15))  consisting 
of  those  mappings  (1^)*  -4  R  induced  on  inputs  of  length  k  by  the  possible  initialized  recurrent 
net  with  a  given  architecture  A.  We  now  describe  some  results  from  the  papers  [24]  aind  [27]. 
In  all  our  results,  w'e  wiU  take  the  number  of  input  components  m  =  1,  for  simpUcity,  and  we 
consider  only  homogeneous  (all  activations  equal)  architectures.  We  assume  we  are  working  in 
discrete  time  cind  interpret  Equations  (9)  as  difference  equations.  By  cr-architecture,  we  mean 
an  architecture  where  all  eictivations  are  the  same  function  <t  :  R  -4  R  (The  choice  of  m  =  1 
makes  our  lower  bounds  more  interesting.  It  is  fairly  easy,  though  notationally  somewhat  more 
cumbersome,  to  extend  the  upper  bounds  to  vector  inputs.  The  same  can  be  said  about  the 
homogeneity  assumption.) 

Given  any  n-dimensional  architecture  A  with  m  =  p  =  1,  and  any  A:  >  0,  we  denote 
VC  (.4,  k)  =  VC  {IFA,k)  and  refer  to  this  quantity  also  as  the  “VC  dimension  of  A  when  receiving 
inputs  of  length  k” .  We  write  r  for  the  total  number  of  parameters  in  the  architecture. 

We  are  particularly  interested  in  imderstanding  the  behavior  of  vc  (.4.,  k)  as  k  ^  ca,  for 
various  recurrent  architectures,  as  well  as  the  dependence  of  this  qu2mtity  on  the  number  of 
weights  and  the  particular  type  of  activation  being  used. 

The  first  case  of  interest  is  that  in  which  a  is  the  identity.  This  means  that  one  is  using  linear 
dynamical  systems  as  learners  in  the  sense  of  PAG  theory  (or,  for  VC  dimension,  thie  sign  of 
the  output  of  such  a  system,  which  represents  the  simplest  possible  quantization  of  the  output 
signal).  Comparing  wnth  classiced  “perceptrons” ,  the  maps  in  represent  in  this  case  inner 
products  with  a  separating  vector  in  R*’  that  is  the  impulse-response  of  a  recursive  digital  filter 
of  order  n.  Seen  in  this  context,  the  usual  perceptrons  cire  nothing  more  than  the  very  special 
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subclass  of  “finite  impulse  response"  systems  (cJl  poles  at  zero);  thus  it  is  appropriate  to  call 
the  more  general  class  “recurrent”  or  “HR  (infinite  impulse  response)”  perceptrons.  One  may 
reliably  predict,  for  given  accuracy  and  confidence  parameters,  on  the  basis  of  0{k)  samples, 
but  this  is  too  conservative  if  there  is  reason  to  believe  that  the  data  may  be  linearly  separable 
by  a  low-dimension  dynamiccd  system,  that  is,  if  we  are  interested  in  learning  using  the  above 
hN'pothesis  class  and  A:  »  n.  Roughly  speaking,  the  main  result  in  [24]  was  that  the  number 
of  samples  needed  is  proportional  to  the  logarithm  of  the  length  k  (as  opposed  to  k  itself,  as 
would  be  the  case  if  one  did  not  take  advantage  of  the  recurrent  structure).  The  upper  bounds 
are  obtained  using  simple  arguments  from  eJgebraic  geometry,  while  the  lower  bounds  involve  an 
apparently  result  on  dual  VC  dimensions,  for  which  we  needed  to  develop  the  theory.  The  main 
result,  for  the  non-sparse  case  in  which  we  assume  that  all  entries  of  the  matrices  are  parameters 
(actually,  the  parameter  space  is  redundant,  and  one  may  assume,  using  appropriate  canonical 
forms,  that  r  =  2n),  is,  more  precisely,  as  follows.  We  write  vc  (n,  k)  instead  of  vc  (.4,  k),  where 
A  is  the  architecture  being  discussed.  Then: 

max|n, nLlog([^-^^J)j|  <  vc  (n.  A:)  <  min{A:,  20n -h  4nlog(A:  —  n-f  1)}  . 

Observe  that  this  means,  in  particular,  that  when  Ar  >  niax{n^,32}  it  holds  that  |logA:  < 
vc  (n,A:)  <  SnlogA:. 

By  a  threshold  Tecaivent  architecture  we  mean  a  homogeneous  one  with  a  =  H.  Our  main 
results  in  [27],  ignoring  multiplicative  consteuits,  say  roughly  the  following:  (1)  For  architectures 
with  activation  a  =  any  fixed  nonlinear  polynomial,  the  VC  dimension  is  «  rk,  and  the  lower 
bound  holds  for  any  sigmoidal  activation  (but  there  are  no  possible  upper  bounds  that  hold  for 
arbitrary  sigmoids).  (2)  For  architectures  with  aictivation  a  —  any  fixed  piecewise  polynomial, 
the  VC  dimension  is  between  rk  and  r^k.  (3)  For  architectures  with  activation  a  =  %  (threshold 
nets),  the  VC  dimension  is  between  rlog(A;/r)  and  min{rA:logrA:,r^  -t-  rlogrA:}.  (4)  For  the 
standand  sigmoid  a{x)  =  1/(1  +  e~*),  the  VC  dimension  is  between  rk  and  r‘*A:^.  Upper  boimds 
are  obtained  by  a  combination  of  “unfolding”  (and  application  of  bounds  known  for  feedforward 
nets)  and  ad-hoc  results.  The  lower  bounds  are  obtained  by  means  of  explicit  examples  of 
architectures  which  perform  various  bit-decoding  operations  on  weights.  In  the  sigmoidal  case 
one  uses  again  the  “locally  linecir  and  globally  threshold”  property.  It  is  possible  to  generalize 
the  lower  bound  that  holds  in  the  sigmoidal  case  to  even  more  arbitrary  activations.  Let  c 
be  a  function  which  is  twice  continuously  differentiable  function  in  an  open  interval  containing 
some  point  xq  where  a"(xo)#0.  The  VC  dimension  of  recurrent  architectures  with  activation  cr, 
with  r  weights  zind  receiving  inputs  of  length  A:,  is  also  Q{rk).  The  construction  in  this  case  is 
based  on  ideas  from  symbolic  dynamics,  essentially  using  a  chaotic-type  system  to  decode  weight 
information  and  hence  affect  the  progress  of  the  computation. 

It  is  interesting  to  contrast  the  situation  with  the  one  that  holds  for  feedforward  nets.  For 
the  latter,  it  holds,  in  general  terms,  that  linear  activations  provide  VC  dimension  proportional 
to  r,  threshold  activations  give  VC  dimension  proportiontil  to  rlog(r),  and  piecewise  polynomial 
activations  result  in  VC  dimension  proportiontil  to  r^. 

3.2.3  Other  Dimensions 

-•Vs  we  discussed,  for  feedforward  nets  the  VC  dimension  grows  in  general  at  least  as  fast  as  the 
square  r^  of  the  number  of  tidjustable  weights  r.  This  fact  can  be  seen  as  optimistic  (relatively  low 
sample  complexity),  but  the  bound  is  more  pessimistic  than  some  experimental  data  would  seem 
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to  indicate.  It  is  essentially  impossible  to  design  experiments  for  testing  PAC  learning,  because 
of  the  need  to  estimate  prediction  confidence  for  large  families  of  distributions;  most  experiments 
deal  with  specific  distributions.  Because  of  this,  the  first  PI  was  asked,  at  the  end  of  his  plenary- 
talk  presentation  of  the  results  from  [25]  and  [53]  at  NIPS’95,  if  it  was  not  possible  that  sets  of 
input  patterns  which  can  be  shattered  are  all  in  some  sense  “special”  cind  that  if  we  ask  instead, 
as  done  in  the  classical  literature  in  pattern  recognition,  for  the  shattering  of  all  sets  in  “general 
position”  (as  in  Cover  s  work  on  capacity  of  perceptrons),  then  an  upper  bound  of  0(r)  might 
hold.  After  further  research,  we  produced  the  paper  [26],  which  answered  this  in  the  affirmative. 
We  established  a  linear  upper  bound  for  arbitrary  sigmoidal  (as  well  as  threshold)  neural  nets 
for  what  we  may  cadi  a  “generic  shattering  dimension”.  We  omit  the  detauls  here,  but  wish  to 
emphasize  that  one  major  direction  for  future  work  is  that  of  investigating  the  relevance  of  our 
generic  shattering  dimension  to  variations  of  PAC  learning,  when  one  weakens  the  requirement 
that  generalization  capabilities  must  hold  with  respect  to  all  possible  input  distributions.  (The 
upper  bound  is  also  useful  in  the  very  different  context  of  understanding  computationaJ  abilities: 
as  an  illustration  of  this  fact,  we  mention  that  our  main  result,  annoimced  electronically,  was 
immediately  employed  by  Maass  in  a  new  paper  contrasting  the  computational  power  of  spiking 
neurons  with  that  of  sigmoidal  neural  networks.) 

3.3  Approximations 

Here  we  deal  with  question  Ql:  “How  large  can  the  errors  Err  f PI  be?”,  for  hypotheses  classes 
of  the  form  T  =  PU.  Recall  that  this  question  characterizes  the  minimum  potentially  achievable 
error,  no  matter  how  many  samples  8ire  seen  or  how  powerful  an  algorithm  is  used.  The  smaller 
Err  (P).  the  more  useful  is  the  estimate  in  Equation  (16).  Take  the  case  when  the  underlying 
probability  densities  in  the  general  learning  paradigm  are  induced  from  input  output  data  of 
the  form  (u,p(u)),  where  y  is  an  unknown  target  function  whose  behavior  we  are  attempting 
to  emulate  by  means  of  hypotheses  /  €  P.  In  that  case,  we  are  led  to  questions  of  function 
approximation.  With  reasonable  prior  assumptions  on  the  possible  g's,  one  tries  to  obtain  good 
estimates  of  the  best  possible  approximation  error  inf/gjrjjy  —  f\\,  for  various  function  space 
norms  on  P.  That  is,  one  wants  to  study  the  distance  to  P  of  elements  g  which  satisfy  the  prior 
assmnptions. 

3.3.1  Feedforward  Nets 

In  the  neural  nets  literature,  one  finds  a  claim  to  the  extent  that  approximations  by  means  of  (IHL) 
neural  networks  may  require  less  parameters  than  conventional  techniques.  What  is  meant  by 
this  is  that  approximations  of  functions  in  certain  classes  (defined  typically  in  harmonic  analysis 
terms  or,  say,  the  unit  ball  of  a  Sobolev  space  to  within  a  desired  error  tolerance  can  be 
obtained  using  “small”  networks.  In  contrast,  the  mgument  goes,  using  for  instance  orthogonal 
polynomials,  splines,  or  Fourier  series,  would  require  an  astronomical  number  of  terms,  especially 
for  multivariate  inputs.  Unfortunately,  this  claim  represents  a  misunderstanding  of  the  very  nice 
results  obtained  by  Andrew  Barron  cind  Lee  Jones  during  the  last  few  yezirs.  Their  results  apply 
in  principle  also  to  give  efficient  approximations  with  various  types  of  classical  basis  functions  as 
long  as  the  basis  elements  can  be  chosen  in  a  nonlinear  fashion  (just  as  is  the  case  with  neural 
networks).  For  instance,  splines  with  free  (rather  than  fixed)  nodes,  or  trigonometric  series 
with  adaptively  selected  frequencies,  will  have  the  same  properties.  What  is  important  is  the 
possibility  of  selecting  terms  adaptively,  in  contrast  to  the  use  of  a  large  basis  containing  many 
terms  and  fitting  these  through  the  use  of  least  squares.  Thus,  the  important  fact  about  these 
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results  is  that  they  emphasize  that  nonlinear  parzuneterizations  may  require  less  parameters 
than  linear  ones  to  achieve  a  guaranteed  degree  of  approximation.  1  Abstractly,  this  is  not  so 
surprising:  for  an  analogy,  consider  the  fact  that  a  even  a  one-parameter  analytic  curve,  based 
on  ergodic  motions,  ctm  be  used  to  approximate  arbitrarily  well  every  element  in  an  Euclidean 
space  ,  but  no  (d— l)-dimensional  subspace  can  do  so.)  For  an  exposition  and  a  precise  theorem 
comparing  rates  of  approximation  by  such  neural  net  and  nonlinear  adaptation  approaches  vs. 
rates  obtainable  with  classical  approximation  techniques,  the  reader  may  wish  to  consult  the 
paper  [3].  (Constraints  on  nonlinear  adaptation  procedures  also  exist,  and  are  based  on  the 
theory  of  nonlinear  iV-widths,  but  we  do  not  discuss  that  subject  here.)  The  paper  [21]  dealt 
with  some  of  these  questions  regarding  rates  of  approximation,  and  we  review  some  of  the  setup 
now. 

The  subject  of  that  paper  concerns  the  problem  of  approximating  elements  of  a  Banach  space 
X  -  t3q)ically  presented  as  a  space  of  functions  -  by  means  of  finite  linear  combinations  of 
elements  from  a  predetermined  subset  5  of  X.  In  contrast  to  classical  lineju-  approximation 
techniques,  where  optimal  approximation  is  desired  and  no  penalty  is  imposed  on  the  number 
of  elements  used,  we  are  interested  there  in  sparse  approximants,  that  is  to  say,  combinations 
that  employ  few  elements.  In  particular,  we  are  interested  in  understanding  the  rate  at  which 
the  achievable  error  can  be  reduced  as  one  increases  the  niunber  allowed.  Such  questions  are  of 
obvious  interest  in  areas  such  as  signal  representation,  numerical  analysis,  and  neural  networks. 
In  that  latter  context,  if  we  wish  to  study  approximations  by  input/output  maps  of  IHL  networks 
with  n  hidden  units,  one  must  consider  linear  combinations  of  n-element  subsets  of 

5  =  {5:®”*  ->R|3A€R"',o6R,  |  g(u)  =  ±ct(A  •  u  4- a)}, 

where  a  :  R  — R  is  the  activation  of  interest.  IHL  approximations  are  of  interest  because  of  the 
results  which  insure  that,  for  each  compact  subset  M  of  R”*,  restricting  elements  of  5  to  M,  the 
closed  lineau-  span  of  5  is  all  of  C^{M)  (under  extremely  weak  conditions  on  cr-  being  locally 
Riemann  integrable  and  non-polynomial  is  enough). 

Rather  than  arbitrary  linear  combinations  <^9u  with  Cj’s  real  and  pi’s  in  5,  it  turns  out  to 
be  easier  to  understand  approximations  in  terms  of  combinations  that  tire  subject  to  a  prescribed 
upper  boimd  on  the  total  coefiicient  sum  |ci|-  After  normalizing  S  and  replacing  it  by  SU—S, 
one  is  led  to  studying  approximations  in  terms  of  convex  combinations.  This  is  the  focus  of  [21]. 
To  explain  the  previous  results  and  new  contributions,  we  first  introduce  some  notation.  Let  X  be 
a  Banach  space,  with  norm  ||  •  ||.  Take  any  subset  S  C  X.  For  each  positive  integer  n,  we  let  cOnS 
consist  of  all  sums  Cjft,  with  gi,---,gn  in  S  and  reals  Cj  €  [0, 1],  —  1-  The  distance 

from  an  element  /  6  A  to  this  space  is  denoted  [[cons'  —  /[[  :=  inf  {[[h  —  f\\,h  €  co„5}.  Let 
be  a  positive  function  on  the  integers.  We  say  that  the  space  X  admits  a  (convex)  approximation 
rate  (f>{n)  if  for  each  bounded  subset  5  of  A  and  each  /  €  coS,  [[co„5  —  /[[  =  0{(f>{n)).  Jones 
and  Barron  showed  that  every  Hilbert  space  admits  an  approximation  rate  (/>(n)  =  l/y/n.  One 
of  our  main  objectives  in  [21]  was  the  study  of  such  rates  for  non-Hilbert  spaces.  (Barron  in 
1992  did  show  that  the  same  rate  is  obtained  in  the  uniform  norm,  but  only  for  approximation 
with  respect  to  sp>ecial  classes  of  sets  S.)  Spaces  U*  with  p  equal  to  or  slightly  greater  than  one 
are  particularly  important  because  of  their  usefulness  for  robust  estimation,  and  there  have  been 
experimental  results  for  regression  with  neural  networks,  showing  the  superiority  ol  IP  {p  <^2) 
to  in  that  context  (see  [21]). 

Another  issue  of  interest  is  as  follows.  Jones  considered  the  procedure  of  constructing  ap¬ 
proximants  to  /  incrementally,  by  forming  a  convex  combination  of  the  last  approximant  with  a 
single  new  element  of  5;  in  this  case,  the  convergence  rate  in  is  interestingly  again  0(1/ ■y/n). 
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Incremental  approximants  axe  especially  attractive  from  a  computationad  point  of  view.  In  the 
neural  network  context,  they  correspond  to  adding  one  “neuron”  at  a  time  to  decrease  the  resid¬ 
ual  error.  We  next  define  this  concept  precisely.  Again  let  X  be  a  Banach  space  with  norm 
II  •  II .  Let  sex.  An  incremental  sequence  (for  approximation  in  coS)  is  any  sequence 
/i,/2,...  of  elements  of  X  so  that  /i  G  S  and  for  each  n  >  1  there  is  some  6  5  so  that 
fn+i  G  CO  ({/niffn})-  We  say  that  an  incremental  sequence  /i,/2, ...  is  greedy  (with  respect  to 
/  G  if  ll/n-f-i  - /II  =  inf{||h-/||  I  €  co  ({/„,g})  G  5}  ,  n  =  1,2,....  The  set  5  is  gen- 
eradly  not  compact,  so  we  cannot  expect  the  infimum  to  be  attained.  Given  a  positive  sequence 
6  =  (ei,  €2!  •  •  •)  of  allowed  “slack”  terms,  we  say  that  an  incremental  sequence  /i ,  /2, . . .  is  e-greedy 
(with  respect  to  /)  if  ||/„+i  -  /||  <  mf{\\h  -  f\\  \  h  e  co  {{fn,  g}) ,  g  G  5}  -I-  ,  n  =  1,2,.... 

Let  be  a  positive  fimction  on  the  integers.  We  say  that  S  has  an  incremental  (convex)  scheme 
with  rate  <f>{n)  if  there  is  an  incremental  schedule  e  such  that,  for  each  /  in  coS  and  each  s-greedy 
incremental  sequence  fi, /2,  •  •  •,  it  holds  that  ||/„  —  /||  =  0{(p{n))  as  n  -l-oo.  Finally,  we  say 
that  the  space  X  admits  incremental  (convex)  schemes  with  rate  <j){n)  if  every  bounded  subset  S 
of  X  has  an  incremental  scheme  with  rate  <^(n).  The  intuitive  idea  behind  this  definition  is  that 
at  each  stage  we  attempt  to  obtain  the  best  approximant  in  the  restricted  subclass  consisting 
of  convex  combinations  (1  —  A„)/„  -t-  Xng,  with  An  in  [0, 1],  g  in  S,  and  fn  being  the  previous 
approximant.  It  is  also  possible  to  select  the  sequence  Ai,A2,...  beforehand.  We  say  that  an 
incremental  sequence  /i,  /2,  •  •  •  is  e-greedy  (with  respect  to  /)  with  convexity  schedule  Ai,  A2, . . . 
if  ll/n+i  - /II  <  inf  {II  ((1  -  A„)/„ -f  A„p)  - /II  1  g  G  S}  4- e„ ,  n  =  1,2, . . .. 

The  main  objective  of  the  paper  [21]  was  to  analyze  both  optimal  and  incremental  rates 
in  broad  classes  of  Banach  spaces,  specifically  including  I/,  1  <  p  <  00.  It  is  a  triviality 
that  optimal  approximants  to  approximable  functions  always  converge.  However,  the  rates  of 
convergence  depend  critically  upon  the  structure  of  the  space.  In  some  spaces,  like  there 
exist  target  functions  for  which  the  rate  can  be  made  arbitrarily  slow.  In  Banach  spaces  of 
(Rademacher)  type  t  with  t  >  1,  however,  a  rate  bound  of  is  obtained.  For  U’  spaces 

these  results  specialize  to  these  ( “no”  means  that  the  approximants  do  not  always  converge) : 


p 

1 

(1.2) 

[2, 00) 

00 

optimal 

1 

n-1/2 

1 

incremental 

NO 

NO 

Particular  examples  of  V  spaces  are  given  to  show  that  the  orders  given  in  our  bounds 
caimot  in  general  be  shcirpened.  In  the  incremental  case,  a  particularly  interesting  aspect  of  the 
results  is  that  the  new  element  of  5  added  to  the  incremental  approximant  is  not  required  to 
be  the  best  possible  choice.  Instead,  the  new  element  can  meet  a  less  stringent  test,  and  the 
convex  combination  of  the  elements  included  in  the  approximant  is  not  optimized.  Instead  a 
simple  average  is  used.  (This  is  an  example  of  a  fixed  convexity  schedule.  Thus,  our  incremental 
approximants  are  the  simplest  yet  studied,  simpler  even  than  those  of  Jones.  Nonetheless,  the 
same  worst-case  order  is  obtained  for  these  approximants  on  U*,  1  <  p  <  00,  as  for  the  optimal 
approximant.  In  more  general  spaces,  the  incremental  approximants  may  not  even  converge. 
However,  if  the  space  has  a  modulus  of  smoothness  of  power  type  greater  than  one,- or  is  of 
Rademacher  type  t,  then  rate  bounds  can  be  given.  The  results  are  based  on  explicit  constructions 
as  well  as  use  of  the  rich  theory  developed  by  Lindenstrauss  on  smoothness  moduli  and  by  Ledoux, 
Talagrand,  and  others  on  probability  in  Banach  Spaces. 
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3.4  Recurrent  Nets 

One  can  ask  similar  approximation  rate  questions  for  recurrent  nets.  Recurrent  nets  provide 
universal  identification  models,  in  the  sense  that  continuous-  or  discrete- time  systems  E  of  the  type 
i:  [  or  !''■  ]  =  /(x.  u) ,  y  =  h(x)  (under  standcird  smoothness  assumptions),  can  be  approximately 
simulated  by  the  behavior  of  a  recurrent  network.  This  was  explained  in  [37].  By  approximate 
simulation  we  mean  as  follows.  In  general,  assume  given  two  systems  E  and  S  as  above,  where 
tildes  denote  data  associated  to  the  second  system,  and  with  same  number  of  inputs  and  outputs 
(but  possibly  n  ^  n).  Suppose  also  given  compact  subsets  Ki  C  1"  and  ^  R'")  as  well 
as  an  e  >  0  and  a  T  >  0.  Suppose  further  (this  simplifies  definitions,  but  can  be  relaxed) 
that  for  each  initial  state  xo  €  Ki  and  each  measurable  control  «(•)  :  [0,r]  -4  K2  the  solution 
4>{t,X(i,u)  is  defined  for  all  f  e  [0,T].  The  system  S  simulates  E  on  the  sets  Ki,K2  in  time  T 
and  up  to  accuracy  e  if  there  exist  two  continuous  mappings  a  :  R”  — )■  1.”  and  /?  :  R”  R” 
so  that  the  following  property  holds:  For  each  xq  6  Ki  and  each  «(•)  ;  [0,  T]  ->  K2,  denote 
x{t)  :=  <j){t,  Xo,  u)  and  x{t)  :=  ^(t, /3(xo),  u);  then  this  second  function  is  defined  for  alH  €  [0,  T], 
llx(t)  —  a(x(t))||  <  £,  and  ||/i(x(t))  -  fi(x(t))||  <  £  for  all  such  t.  Assume  that  ct  is  a  universal 
activation  (dilates  and  translates  of  cr  are  dense  on  continuous  functions  with  the  compact-open 
topology).  Then,  for  each  system  E  and  for  each  K\,  K2,  £,  T  as  above,  there  is  a  a-system 
E  that  simulates  E  on  the  sets  Ki,K2  in  time  T  and  up  to  accuracy  e.  Thus,  recurrent  nets 
approximate  a  wide  class  of  nonlinear  plants.  Note,  however,  that  approximations  are  only  valid 
on  compact  subsets  of  the  state  sp8u:e  and  for  finite  time,  so  that  many  interesting  dynamical 
characteristics  are  not  reflected.  This  is  analogous  to  the  role  of  bilinecir  systems,  shown  to  be 
universal  in  analogous  fashion  in  Sussmann’s  weU-known  1975  paper.  As  with  bilinear  systems, 
it  is  obvious  that  if  one  imposes  extra  stability  assumptions  ( “fading  memory”  type)  it  will  be 
possible  to  obtain  global  approximations,  but  this  is  probably  not  very  useful,  as  stability  is  often 
a  goal  of  control  rather  than  an  assumption. 

3.5  Computing  Risk  Minimizer 

Here  we  dezJ  with  question  Q3:  “Is  it  computationally  feasible  to  find  Emp(z/)?” ,  for  hypotheses 
classes  of  the  form  T  =  T^.  That  is  to  say,  given  a  cost  fimction  involving  fitting  an  architecture 
to  training  data,  what  are  the  theoretical  limitations  of  finding  the  value  of  the  least  error,  and/or 
actually  finding  the  minimizing  parameters? 

The  approach  we  took  in  [20]  originated  with  the  work  of  Judd,  Blum  and  Rivest,  Lin  and 
Vitter,  and  others.  Judd  observed  that  gradient  descent  techniques  used  in  practice  for  the  IHL 
problem  seem  to  be  subject  to  a  “curse  of  dimensionality”.  For  the  simpler  case  of  linearly 
separable  data,  the  perceptron  algorithm  and  linear  programming  techniques  help  find  a  network 
-  with  no  “hidden  units”  -  relatively  fast.  Thus  one  may  ask  if  there  exists  a  fundamental  barrier 
to  training  by  general  feedforward  networks,  a  barrier  that  is  insurmountable  no  matter  which 
particular  algorithm  one  uses.  The  simplest  version  of  this  is  to  ask  about  the  tractability  of  the 
training  problem,  that  is,  of  the  question:  “Can  we  determine  if  Emp  (i/)  =  0?  or  in  equivalent 
terms:  “Given  a  network  architecture  (interconnection  graph  as  well  as  choice  of  activation 
function)  and  a  set  of  training  examples,  does  there  exist  a  set  of  weights  so  that  the.  network 
produces  the  correct  output  for  all  examples?” 

The  simplest  neural  network,  i.e.,  the  perceptron,  is  a  network  that  consists  of  one  threshold 
neuron  only.  It  is  easily  verified  that  the  computation2il  time  of  the  loading  problem  in  this 
case  is  polynomial  in  the  size  of  the  training  set  irrespective  of  whether  the  input  is  analog  or 
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binary.  This  can  be  achieved  via  a  linear  programming  technique.  We  showed  that,  for  IHL 
networks  employing  a  simple,  piecewise  linear  activation  fimction.  and  just  tw^o  hidden  imits,  the 
training  problem  is  NP-complete.  Recadl  that  if  one  establishes  that  a  problem  is  NP-complete 
then  one  has  shown,  in  the  staindard  way  done  in  computer  science,  that  the  problem  is  at  least 
as  hard  as  most  problems  widely  believed  to  be  hard  (the  “traveling  salesman"  problem,  Booleain 
satisfiability,  and  so  forth).  This  shows  that,  indeed,  any  possible  neural  net  learning  algorithm 
based  on  fixed  architectures  faces  severe  computational  barriers.  Our  results  generalized  those  of 
Blum  and  Rivest.  which  only  proved  a  similar  NP-completeness  conclusion  for  networks  having 
the  same  architecimre  but  differing  from  ours  in  that  the  activation  fuijctions  are  all  of  a  hard 
threshold  type. 
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1  Objectives 

This  project  focuses  on  fundamental  theoretical  issues  relevant  to  the  capabilities,  performance,  and 
limitations  of  artificial  neural  networks. 

2  Status  of  effort 

Work  continued  on  all  major  subareas  of  the  project,  and  substantial  progress  was  achieved  in  many  of 
them.^ 

Among  the  main  new  contributions  are  the  derivation  of  tight  bounds  for  the  sample  complexity  of 
learning  recurrent  nets,  both  in  continuous  and  discrete  time,  the  start  of  a  computational  complex¬ 
ity  study  of  hybrid  system  equivalence,  and  the  complete  characterization  of  reachability  for  states  of 
recurrent  nets. 


3  Accomplishments/New  Findings 

The  object  of  this  project  is  to  explore  theoretical  issues  regarding  capabilities,  performance,  and  limit¬ 
ations  of  “neural”  and  other  alternative  models  of  computation  and  control.  Much  research  carried  out 
by  AFOSR  contractors  zuid  Air  Force  labs  concerns  the  use  of  neural  network  techniques  when  deal¬ 
ing,  among  others,  with  problems  of  fault  detection  and  classification  (in  particular,  for  reconfigurable 
aircraft),  design  of  controls  valid  over  large  flight  envelopes,  and  precision  laying  of  composites  on  air¬ 
craft  structures.  Unquestionably,  there  is  a  demand  for  basic  theoretical  foundations  for  the  analysis  and 
comparison  of  different  models  and  algorithms.  This  work  concerns  the  development  of  such  foundations. 


3.1  Identification 

The  question  of  “identification”  or  “learning”  has  to  do  with  the  development  of  algorithms,  and  basic 
associated  theoretical  notions,  for  the  fitting  of  models,  in  particular  to  time-structured  data.  Typically, 
there  is  a  “training”  phase,  in  which  a  learning  system  is  presented  with  a  representative  sample  of 
labeled  input/output  pairs,  and  the  goal  is  to  form  associations  which  result  in  good  responses  for  future 
inputs,  including  of  course  inputs  which  were  not  seen  exactly  during  the  training  period.  Most  of  our 
work  deals  with  the  PAC  (or  uniform-convergence  of  empirical  means)  estimates,  pioneered  by  Vapnik, 
Valiant,  and  others.  In  particular,  we  have  concentrated  on  estimates  of  sample  complexity  (amount  of 
data  needed  for  reliable  generalization). 

Following  up  on  our  previous  breakthroughs  on  lower  bounds  for  VC  dimension  of  feedforward  nets  (2, 
4),  we  have  now  obtained  tight  bounds  for  recurrent  networks  as  well  (see  9,  10).  Ignoring  multiplicative 
constants,  the  main  results  say  roughly  the  following:  (1)  For  architectures  with  activation  any  fixed 
nonlinear  polynomial,  the  VC  dimension  is  proportional  to  both  the  number  of  parameters  in  the  model 
and  the  length  of  inputs  in  a  given  window;  (2)  For  architectures  whose  activation  function  is  any 
fixed  piecewise  polynomial,  the  VC  dimension  is  at  bounded  by  the  square  of  the  number  of  parameters 
multiplied  by  the  length  of  inputs;  (3)  For  architectures  whose  activation  function  is  the  standard  sigmoidal 
tanh,  we  have  upper  bounds  logarithmic  in  input  length  and  of  fourth  order  in  number  of  parameters  (but 
lower  bound  is  only  quadratic  in  number  of  parameters,  and  one  of  our  research  directions  is  to  narrow 
this  gap).  These  bounds  should  form  the  basis  of  an  approach  to  learning  time-series  data,  and  during 
our  lectures  and  visit  to  the  special  semester  at  the  Newton  Institute,  this  direction  for  further  research 
was  discussed  with  several  of  the  leading  researchers  in  the  area.  Two  students  have  been  working  in 
this  project,  and  its  relations  to  systems  identification. 

‘The  Web  site  http://www.math.rutgers.edu/‘sontag/  can  be  consulted  for  up  to  date  versions  of  (p)reprints. 
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3.2  Dynamics  of  Recurrent  Nets,  and  Systems  with  Saturations 

A  major  success  was  the  work  5,  in  which  we  were  able  to  provide  a  characterization  of  transitivity 
(complete  controllability)  for  recurrent  nets.  Follow-up  work  is  trying  to  weaken  the  assumptions,  hence 
enlarging  the  class  of  models  to  which  the  result  applies.  (In  particular,  current  work  has  resulted  in  an 
extension  of  the  results  to  the  model,  also  of  great  interest,  in  which  the  right-hand  side  depends  linearly 
on  sigmoidal  values;  a  paper  with  a  graduate  student  is  in  preparation.) 

The  paper  3  provided  a  general  result  reagarding  stabilization  of  linear  systems  with  saturated  ac¬ 
tuators,  in  the  form  of  feedforward  nets,  extending  our  previously-reported  results  for  the  continuous 
case.  We  are  now  studying  (with  Z.  Lin)  extensions  of  these  techniques  to  account  for  disturbances, 
concentrating  first  in  the  continuous-time  case. 

The  question  of  stability  of  recurrent  nets  is  still  very  much  open,  and  is  the  subject  of  current  efforts, 
including  work  by  graduate  students  and  postdocs  partially  supported  by  the  project.  We  have  had 
preliminary  success  with  the  two-dimensional  problem,  but  even  that  case  has  so  far  defied  a  complete 
solution. 


3.3  Hybrid  Systems 

Systems  combining  discontinuous  and  continuous  components  are  becoming  the  object  of  serious  interest 
in  the  context  of  control  and  computing.  We  have  continued  our  work  on  analog  computing  and  on 
piecewise-linear  systems,  and  in  particular  we  have  just  obtained,  with  B.  Dasgupta,  a  polynomial  time 
algorithm  to  check  equivalence  (a  paper  will  be  prepared  for  submission  during  the  next  few  months) . 


4  Personnel  Supported 

Partial  summer  support  for:  Pis,  graduate  students  (P.  Kuusela,  M.  Krichman,  M.  Hasson,  Y.  Qiao,  and 
S.  Koskie),  and  postdocs  (B.  Dasgupta  and  Z.  Lin). 


5  Publications  (all  with  Sontag  as  coauthor) 

List  of  peer-reviewed  publications  appeared,  submitted,  and/or  accepted  during  period  1  Oct  1996  to  30 
Jun  1997. 

1.  (with  C.  Darken,  C.,  M.  Donahue,  and  L.  Gurvits)  “Rates  of  convex  approximation  in  non-Hilbert 
spaces,”  Constructive  Approximation  13(1997):  187-220 

2.  (with  B.  Dasgupta)  “Sample  complexity  for  learning  recurrent  perceptron  mappings,”  IEEE  Trans. 
Inform.  Theory  42  (1996):  1479-1487. 

3.  (with  H.J.  Sussmann  and  Y.  Yang)  “Global  stabilization  of  linear  discrete- time  systems  with 
bounded  feedback,”  Systems  and  Control  Letters,  30  (1997):  273-281. 

4.  (with  P.  Koiran)  “Neural  networks  with  quadratic  VC  dimension,”  J.  Comp.  Syst.  Sci.  54(1997): 
190-198. 

5.  (with  H.J.  Sussmann)  “Complete  controllability  of  continuous-time  recurrent  neural  networks,” 
Systems  and  Control  Letters  30(1997):  177-183. 

6.  “Shattering  all  sets  of  k  points  in  ‘general  position’  requires  {k  —  l)/2  parameters,”  Neural  Com¬ 
putation  9(1997):  337-348. 
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7.  “Some  learning  and  systems-theoretic  questions  regarding  recurrent  neural  networks,”  in  Proc. 
Conf.  on  Information  Sciences  and  Systems  (CISS  97),  Johns  Hopkins,  Baltimore,  MD,  March 
1997,  pp.  630-635. 

8.  (with  R.  Koplon)  “Using  Fourier-neural  recurrent  networks  to  fit  sequential  input/output  data,’ 
Neurocomputing  15(1997):  225-248. 

9.  (with  P.  Koiran)  “Vapnik-Chervonenkis  dimension  of  recurrent  neural  networks,”  Discrete  Applied 
Math.,  to  appear. 

10.  (with  P.  Koiran)  “Vapnik-Chervonenkis  dimension  of  recurrent  neural  networks,”  in  Proc.  Third 
European  Conf  Computational  Learning  Theory,  Jerusalem,  March  1997,  to  appear. 


6  Interactions 

Major  interactions  (Sontag  unless  otherwise  stated)  included: 

•  1997  Conference  on  Information  Sciences  and  Systems,  Johns  Hopkins  University,  3/97  (2  talks:  ”  A 
remark  on  robust  stabilization  of  general  asymptotically  controllable  systems”  and  ’’Some  learning 
and  systems-theoretic  questions  regarding  recurrent  neural  networks”) 

•  Neural  Information  Processing  conference,  Snowmass,  CO,  12/97.  (1  hour  talk  in  workshop  Sys¬ 
tems  Theory  and  Nonlinear  Dynamics  in  Neural  Networks”:  recurrent  networks,  and  1/2-hour  talk 
in  worksop  ’’Modeling  Error  Surfaces”:  local  minima  in  feedforward  nets) 

•  UC  Santa  Cruz,  5/97  (Colloquium:  hybrid  techniques  in  control) 

•  UC  Santa  Barbara,  5/97  (2-hour  talk:  input  to  state  stability  and  related  notions,  and  also  1-hour 
t2ilk:  recurrent  neural  networks). 

•  Princeton  Applied  Mathematics  3/97  (Colloquium:  recurrent  neural  networks) 

•  Princeton  Mechanical  and  Aerospace  Engineering  2/97  (Colloquium:  nonlinear  feedback  design) 

•  Isaac  Newton  Institute  programme  on  “Neural  Networks  and  Machine  Learning” ,  Cambridge  Uni¬ 
versity,  August  1997  (3  hours  of  lectures:n  VC  dimension  and  learning  theory) 

•  AMS  1997  Summer  Research  Institute  “Differential  Geometry  and  Control”,  Boulder,  June/ July 
1997  (1-hr  invited  lecture:  nonlinear  stability) 

•  Applications  and  Science  of  Artificial  Neural  Networks  Conference,  SPIE’s  1997  International  Sym¬ 
posium  on  Aerospace/Defense  Sensing  and  Controls,  Orlando,  April  1997  (expository  talk:  recur¬ 
rent  neural  networks) 

•  SUNY-  Stony  Brook  Engineering  College  Seminar,  3/97  (Colloquium:  feedback  stabilization  of 
nonlinear  systems) 

•  (Sussmann:)  Kyoto  University,  12/96  (colloquium  talks) 

•  (Sussmann:)  35th  IEEE  Conference  on  Decision  and  Control  (CDC),  Kobe,  12/96. 

•  (Sussmann:)  Tokyo  University,  12/96  (Colloquium  talk  on  recurrent  networks) 

•  (Sussmann:)  Weizmann  Institute,  Rehovot,  Israel,  12/96. 

•  (Sussmann:)  Technion,  Haifa,  Israel,  12/96,  Conference  “Singularities  and  Control  Theory”. 

•  (Sussmann:)  University  of  Florida,  3/97,  IFIP  Conference  on  Optimal  Control:  Theory,  Algorithms 
and  Applications  (article  in  SIAM  News  described  Sussmann’s  talk). 
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•  (Sussmann:)  INSA  (Institut  National  de  Sciences  Appliques),  Rouen,  France,  5-6/97,  various  sem¬ 
inar  talks. 

•  (Sussmann:)  Summer  Research  Institute  of  the  American  Mathematical  Society,  on  Differential 
Geometry  and  Control”,  Boulder,  7/97,  co-organizer  and  lecturer. 

Other  relevant  conference-related  activities:  Program  Committee,  Int.  Workshop  on  Hybrid  and  Real- 
Time  Systems,  Grenoble,  March  1997. 

Current  Editorships  (Sontag):  Associate  Editor,  Journal  of  Computer  and  Systems  Sciences;  Associ¬ 
ate  Editor,  Neural  Networks;  Associate  Editor,  Neurocomputing;  Member  of  Board  of  Advisors,  Neural 
Computing  Surveys;  Associate  Editor  at  Large,  IEEE  Transactions  in  Automatic  Control;  Associate  Ed¬ 
itor,  Dynamics  and  Control;  Associate  Editor,  Control:  Theory  and  Advanced  Technology,  Mita  Press; 
Associate  Editor,  SMAI  Electronic  Journal  on  Control,  Optimization  and  the  Calculus  of  Variations;  Co- 
Editor  of  book  series.  Communications  and  Control  Engineering;  (Co-)Editor,  Mathematics  of  Control, 
Signals,  and  Systems. 

7  New  discoveries,  inventions,  or  patent  disclosures 

None. 

8  Honors/ A  wards 

None. 
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